Storage of Data In A Distributed Storage System

ABSTRACT

A distributed storage system stores data for files. A first blob (binary large object) of data is received. The first blob is split into one or more first chunks of data. Content fingerprints for the first chunks of data are computed. The first chunks of data are stored in a chunk store while and their content fingerprints are stored in a store distinct from the chunk store. A second blob of data is received. The second blob is split into one or more second chunks of data. Content fingerprints for the second chunks of data are computed. Then for a second chunk of data whose content fingerprint matches a content fingerprint of a first chunk of data, a second reference to the corresponding first chunk of data that has a matching content fingerprint is stored, but the second chunk of data is not stored.

PRIORITY

This application claims priority to U.S. Provisional Application Ser.No. 61/302,930, filed Feb. 9, 2010, entitled “Storage of Data in aPlanet-Scale Distributed Storage System”, which is incorporated byreference herein in its entirety.

TECHNICAL FIELD

The disclosed embodiments relate generally to distributed storagesystems, and more specifically to storage of blobs in large-scaledistributed storage systems.

BACKGROUND

User applications are commonly delivered to end users with web-basedinterfaces. These applications are available to millions of users allover the world, and require a substantial amount of space for datastorage. For example, the Gmail™ application is used by many millions ofusers, and requires storage space for each user's email. Suchapplications impose several constraints on the storage system, and priorart systems do not satisfactorily meet these constraints.

One desirable property of a storage system is that it be both large andscalable. Even if a storage system could handle current storage needs,many systems will not scale to meet the growing needs.

Another desirable property of a storage system is that the data storedis near the end user so that that reading and writing data is fast. Asingle centralized storage facility at one location or a small number oflocations does not meet the needs of users throughout the world becausesome users have to read and write data over slow network links.

Another desirable property of a storage system is that the data bereliably backed up, so that it can recover from both natural and humanerrors. Many storage systems do not maintain multiple copies of data, sothat recovery could require retrieval from tape backup, taking a verylong time.

Another desirable property of a storage system is that network and datacenter failures should be transparent to end users. In most systems, ifa network link or data center goes down, some users will not be able toaccess their own data while the failure is resolved or a temporaryworkaround is manually implemented.

SUMMARY

The above deficiencies and other problems associated with existingdistributed storage systems are addressed by the disclosed embodiments.Some of the disclosed embodiments implement distributed storage systemswith instances located throughout the world. Replicas of data blobs aredistributed throughout the storage system, with new blobs created nearthe relevant users. Based on both usage and policy, copies of blobs aretransmitted to other instances, which optimize storage space based onthe actual needs of the end users. The architecture of the discloseddistributed storage system embodiments facilitates growth, both withinindividual instances, and the addition of new instances. Moreover, inthe disclosed architecture, various portions of the data are effectively“backed up” by other copies of the data elsewhere within the distributedstorage system. In addition, the disclosed architecture facilitateslocating data near where it is used, so that users everywhere haverelatively fast access.

In accordance with some embodiments, a distributed storage system forstoring electronic data comprises instances, which may be localinstances or global instances. The system has a plurality of localinstances, and at least a subset of the local instances are atphysically distinct geographic locations. Each local instance includes aplurality of server computers, each having memory and one or moreprocessors. Each respective local instance is configured to store datafor a respective non-empty set of blobs in a plurality of data storeshaving a plurality of distinct data store types and store metadata forthe respective set of blobs in a metadata store distinct from the datastores. The system has a plurality of global instances. Each globalinstance includes a plurality of server computers, each having memoryand one or more processors. Each global instance is configured to storedata for zero or more blobs in zero or more data stores and storemetadata for all blobs stored at any local or global instance. Oneglobal instance has a background replication module that replicatesblobs between instances according to blob policies.

In accordance with some embodiments, a distributed storage system forstoring electronic data comprises instances, which may be localinstances or global instances. The system has a plurality of localinstances, and at least a subset of the local instances are atphysically distinct geographic locations. Each local instance includes aplurality of server computers, each having memory and one or moreprocessors. Each respective local instance is configured to store datafor a respective non-empty set of blobs in a plurality of data storeshaving a plurality of distinct data store types and store metadata forthe respective set of blobs in a metadata store distinct from the datastores. The system has a plurality of global instances. Each globalinstance includes a plurality of server computers, each having memoryand one or more processors. Each global instance is configured to storedata for zero or more blobs in zero or more data stores and storemetadata for all blobs stored at any local or global instance. Eachlocal or global instance has a dynamic replication module thatdynamically replicates blobs from one local or global instance toanother local or global instance based on user requests to access blobsthat are not stored at a local or global instance near the user.

In accordance with some embodiments, a distributed storage system forstoring electronic data comprises a plurality of instances. Eachinstance includes a plurality of server computers having memory and oneor more processors. At least a subset of the instances are at physicallydistinct geographic locations. Each instance stores data for a pluralityof blobs. Each blob has an associated blob policy that specifies thedesired number of copies of the blob as well as the desired locationsfor copies of the blob. The system includes a location assignment moduleconfigured to compare the desired number of copies of each blob anddesired location constraints for each blob to a current number of copiesof each blob and current locations of copies of each blob. The locationassignment module is also configured to issue commands to delete a copyof a respective blob or to replicate a respective blob to anotherinstance when the current number of copies of a respective blob and/orcurrent locations of the respective blob are inconsistent with thedesired number of copies of the respective blob or the desired locationconstraints of the respective blob.

In accordance with some embodiments, a computer-implemented method ofutilizing a tape system for data storage executes at one or more servercomputers, each having one or more processors and memory. The memorystores one or more programs for execution by the one or more processorson each server computer. The method receives a request to store a blobof data in a tape store, and the request includes the content of theblob. The method writes the content of the blob to a first tape storebuffer. Then, when a predefined condition is met, the method writes thecontent from the first tape store buffer to a tape. In some embodiments,the predefined condition is that the first tape store buffer fills to afirst threshold percentage of capacity. In some embodiments, thepredefined condition is that a predefined length of time has passedsince a last time content was written from the first tape store bufferto a tape. Other embodiments have a predefined condition that is acombination of these two conditions. The method later receives a requestfrom a client to read the blob of data from the tape store. When readrequests reach a second threshold, the method reads the contents of theblob from tape, and writes the contents of the blob to a second tapestore buffer. The method sends a message to the client indicating thatthe blob contents are available for reading.

In accordance with some embodiments, a computer-implemented method ofstoring data for files executes at one or more server computers, eachhaving one or more processors and memory. The memory stores one or moreprograms for execution by the one or more processors on each servercomputer. The method receives a first blob of data, and splits the firstblob of data into one or more first chunks of data. The method computesa content fingerprint for each of the first chunks of data. The methodstores the first chunks of data in a chunk store and stores the contentfingerprints of the first chunks of data in a store distinct from thechunk store. The method also receives a second blob of data, and splitsthe second blob of data into one or more second chunks of data. Themethod computes a content fingerprint for each of the second chunks ofdata. For each second chunk of data whose content fingerprint matches acontent fingerprint of a first chunk of data, the method stores a secondreference to the corresponding first chunk of data that has a matchingcontent fingerprint and does not store the second chunk of data itself.For each second chunk of data whose content fingerprint does not match acontent fingerprint of a first chunk of data, the method stores thesecond chunk of data in a chunk store.

In accordance with some embodiments, a computer-implemented method ofstoring data for files executes at one or more server computers, eachhaving one or more processors and memory. The memory stores one or moreprograms for execution by the one or more processors on each servercomputer. The method receives a first representation of a blob of datahaving a specified first representation type, and stores the firstrepresentation of the blob of data. The method also stores metadata forthe blob of data, including a name of the blob, the representation type,and a storage location for the first representation of the blob. Themethod also receives a request to create a second representation of theblob with a second representation type, and creates a secondrepresentation of the blob having the second representation type. Themethod stores the second representation of the blob of data and updatesthe metadata for the blob of data to indicate the presence of the secondrepresentation of the blob with the second representation type. Themethod receives a request from a client for a copy of the blob, and therequest includes a specified representation type. The method retrieveseither the first representation of the blob or the second representationof the blob, the retrieved representation of the blob corresponding tothe representation type requested by the client. The method sends theretrieved representation of the blob to the client.

In accordance with some embodiments, a computer-implemented method ofreading a blob from a distributed storage system executes at a client ona computer having one or more processors and memory. The memory storesone or more programs for execution by the one or more processors on thecomputer. The method receives a request from a user application for ablob and locates an instance within the distributed storage system thatis geographically close to the client. The method contacts a blob accessmodule at the located instance to request metadata for the requestedblob. The request includes user access credentials. The method receivesfrom the blob access module a collection of metadata from the requestedblob, and a set of one or more read tokens. The method selects aninstance that has a copy of the requested blob based on the receivedcollection of metadata and contacts a data store module at the selectedinstance. The method provides the data store module with the set of oneor more read tokens. The method receives the content of the requestedblob in one or more chunks and assembles the one or more chunks to formthe requested blob. The method returns the blob to the user application.

Thus methods and systems are provided that are scalable, and efficientlyuse existing storage capacity and network bandwidth. The methods andsystems effectively use the distributed resources to place copies ofblobs near where they are needed, with additional copies at otherlocations that can function as real-time backups. Because of intelligentbackground replication and replication based on immediate end userneeds, the disclosed methods and system provide a system that isreliable, provides quick access for users, and uses the existing storagecapacity effectively.

BRIEF DESCRIPTION OF THE DRAWINGS

For a better understanding of the aforementioned embodiments of theinvention as well as additional embodiments thereof, reference should bemade to the Description of Embodiments below, in conjunction with thefollowing drawings in which like reference numerals refer tocorresponding parts throughout the figures.

FIG. 1A is a conceptual illustration for placing multiple instances of adatabase at physical sites all over the globe according to someembodiments.

FIG. 1B illustrates basic functionality at each instance according tosome embodiments.

FIGS. 1C-1G illustrate ways that a distributed storage system may beintegrated with systems that provide user applications according to someembodiments.

FIG. 2 is a block diagram illustrating multiple instances of areplicated database, with an exemplary set of programs and/or processesshown for the first instance according to some embodiments.

FIG. 3 is a block diagram that illustrates an exemplary instance for thesystem, and illustrates what blocks within the instance a user interactswith according to some embodiments.

FIG. 4 is a block diagram of an instance server that may be used for thevarious programs and processes illustrated in FIGS. 1B, 2, and 3,according to some embodiments.

FIG. 5 illustrates a typical allocation of instance servers to variousprograms or processes illustrated in FIGS. 1B, 2, and 3, according tosome embodiments.

FIG. 6 illustrates how metadata is stored according to some embodiments.

FIG. 7 illustrates a data structure that is used to store deltasaccording to some embodiments.

FIG. 8 illustrates an exemplary compaction process according to someembodiments.

FIG. 9 illustrates a sequence of events in the replication processaccording to some embodiments.

FIG. 10 is a block diagram that illustrates a client computer accordingto some embodiments.

FIGS. 11A-11C illustrate a method of replicating distributed dataaccording to some embodiments.

FIGS. 12A-12B illustrate a method of compacting data in a distributeddatabase according to some embodiments.

FIG. 13 illustrates a method of reading a piece of data from adistributed database according to some embodiments.

FIGS. 14A-14D illustrate skeletal data structures for egress and ingressmaps according to some embodiments.

FIGS. 15A-15B illustrate a process of developing a transmission plan forsending database changes to other instances according to someembodiments.

FIG. 16 provides an example of evaluating the cost of varioustransmission plans according to some embodiments.

FIG. 17 illustrates a method of determining a compaction horizon usingingress maps according to some embodiments.

FIGS. 18A-18E illustrate data structures used to store metadataaccording to some embodiments.

FIG. 19 illustrates a method of utilizing a tape device as a data storeaccording to some embodiments.

FIG. 20 illustrates a method of implementing content-basedde-duplication according to some embodiments.

FIG. 21 illustrates a method of efficiently creating an utilizingmultiple representations of a blob according to some embodiments.

FIG. 22 illustrate a method of reading a blob stored in a distributedstorage system according to some embodiments.

FIG. 23 is a block diagram that illustrates a process to reduce theamount of storage using content-based de-duplication according to someembodiments.

FIGS. 24A-24C illustrate an exemplary set of operations to create andretrieve multiple representations of the same blob according to someembodiments.

FIG. 25 is a block diagram that illustrates a process of reading a blobdeom a distributed storage system according to some embodiments.

FIG. 26 is a block diagram that illustrates a three layer stable clocksystem according to some embodiments.

FIG. 27 provides an exemplary list of blob policies, and illustrates therelationship between blobs and blob policies according to someembodiments.

FIG. 28 provides a high-level illustration of how blobs are storedaccording to some embodiments.

Reference will now be made in detail to embodiments, examples of whichare illustrated in the accompanying drawings. In the following detaileddescription, numerous specific details are set forth in order to providea thorough understanding of the present invention. However, it will beapparent to one of ordinary skill in the art that the present inventionmay be practiced without these specific details.

The terminology used in the description of the invention herein is forthe purpose of describing particular embodiments only and is notintended to be limiting of the invention. As used in the description ofthe invention and the appended claims, the singular forms “a”, “an” and“the” are intended to include the plural forms as well, unless thecontext clearly indicates otherwise. It will also be understood that theterm “and/or” as used herein refers to and encompasses any and allpossible combinations of one or more of the associated listed items. Itwill be further understood that the terms “comprises” and/or“comprising,” when used in this specification, specify the presence ofstated features, steps, operations, elements, and/or components, but donot preclude the presence or addition of one or more other features,steps, operations, elements, components, and/or groups thereof.

DESCRIPTION OF EMBODIMENTS Purpose

Embodiments of the present invention provide a distributed storagesystem. In some embodiments, the distributed storage system is global orplanet-scale. The term “Planet-scale” contrasts the disclosedembodiments with existing machine-scale or data-center-scale storagesystems, but does not necessarily require that the elements be locatedall over the planet. The disclosed embodiments form a single storagesystem from the perspective of its users, even in an environment withmany data centers (sometimes referred to as instances). Planet-scalesystems differ from data-center-scale systems primarily in that thenetwork link between two data centers is orders of magnitude slower andof lower capacity than the links within a data center, sodata-center-scale techniques do not apply.

Advantages of the disclosed embodiments include functionality that:

-   -   makes temporary datacenter unavailability events as invisible as        possible to the user. The disclosed embodiments adapt to the        unavailability of one data center by directing traffic to other        data centers and potentially making additional copies of data at        additional data centers. Outages of data centers or certain        network links to data centers are fairly common. Because the        storage for a single user's data may be spread over a large        number of data centers, this creates difficulties for        applications that lack a planet-scale storage system.    -   makes decisions about where to store individual pieces of data        on its own. This means that a user is insulated from issues        related to insufficient capacity being available at any        particular data center. The disclosed embodiments will simply        spread their data over multiple data centers. This automatic        distribution also addresses the case where a data center is        unavailable in the long term or even permanently: the disclosed        embodiments can easily transfer the data elsewhere, without        needing to notify the user.

The disclosed embodiments are designed primarily for immutable or weaklymutable data. “Weakly mutable” means that, when you change an entry,that change will ultimately propagate everywhere, but the time for thepropagation is not constrained. This is sometimes referred to as“eventually consistent.” On the other end of the spectrum is “stronglymutable” data. For strongly mutable data, once you have written achange, all future reads are guaranteed to return the newly writtenvalue, regardless of where the user or data reside. Many applicationsonly require weak mutability, or no mutability at all, and this can beimplemented much more cheaply than strong mutability, so there is anadvantage in doing so. The disclosed embodiments primarily address theneeds of weakly mutable data, although some of the disclosed methodsapply to distributed storage systems in general without regard towhether the underlying data is weakly mutable or strongly mutable.

The disclosed embodiments form a “blob store.” A blob store maps blobnames onto arbitrary contents, and the blob store makes no attempt tointerpret the contents. In this way, a blob store is conceptuallysimilar to a file system, with a blob name corresponding to file name.

One feature of the disclosed embodiments is dynamic replication. At anypoint in time, a blob may have one or more replicas. Replicas may beadded on-the-fly in response to demand. This means that blobs that arein high demand can get a large number of replicas (improving latency,availability, and so on) without user intervention, while blobs that arein low demand have less replication and a lower cost for storage.

Another feature of the disclosed embodiments is background replication.Users can specify a replication policy such as “keep two copies on diskand one on tape, in three different metro areas.” The system willmonitor blobs in the background, and add or remove replicas in variouslocations, in order to satisfy this policy. The system that implementsthis background replication must trade off costs of storage and transitto and from various locations.

The combination of demand based replication and background replicationbased on policy can provide fairly optimal storage at a much lower cost.Since the disclosed embodiments can add and remove replicas on aper-blob basis, and do so dynamically, users can specify a baselinepolicy for the least-needed blobs, and rely on real-time replication toadd replicas for just those blobs that need additional copies. This cangreatly reduce the overall cost of data storage.

An additional feature of the disclosed embodiments is content-basedde-duplication. In the underlying storage system, if two blobs haveidentical contents, the data is stored only once. For example, considerthe use of a blob store to store email attachments. If a person sendscopies of the same attachment to multiple recipients, some embodimentsof the present invention would only store a single copy of theattachment.

The disclosed embodiments are implemented on top of variousdata-center-scale storage systems such as BigTable and GFS (Google FileSystem). That is, embodiments of the present invention utilize bothBigTable storage and GFS storage as data stores for blobs.

Various features of the disclosed embodiments resolve problems createdby prior art data storage systems. For example, keeping track of whichpiece of data is at which data center is very complex, especially whenthere is blob-level granularity. Without a dedicated system that managesthe locations of individual blobs, most applications forego implementingindividualized locations: such systems instead stick with a conceptuallysimpler system along the lines of “we have a complete data set X, and wehave copies of the entire dataset at data centers A, B, and C.” Thecomplete-dataset-only solution makes it easy to find a piece of data (itis at every data center), but creates other problems.

One problem with the complete-dataset-only solution occurs when a datacenter becomes unavailable. The software that accesses the data must beable to handle the outage and reroute user requests intelligently. Thisalone largely eliminates the perceived simplicity of have a completedata set at each data center because the application software cannotrely on any individual data center.

A complete-dataset-only implementation also requires enough capacity atevery data center to store the entire dataset. Not only is thisexpensive, it is also sometimes impossible to extend capacity at aparticular data center (e.g. because one has run out of electricalcapacity). This means that if the service needs more data, it needs toretire an existing data center, get capacity at a new datacenter,transfer all of the data (while simultaneously providing user access,because the service can't shut down), reconfigure the systems torecognize the new set of data centers, etc. Similar problems happen if adata center needs to have long-term maintenance or other unavailability.This is a major problem for distributed applications.

Another problem with a complete-dataset-only implementation isover-storage of little-needed blobs. Generally, the number of copies ofthe dataset has to be fixed by the number of copies needed for themost-needed blobs. Even if just a small number of blobs require a largenumber of copies, the same number of copies applies to all of the otherblobs, creating large unnecessary overhead costs with little value.

Because of these factors, application developers artificially reduce thenumber of data centers at which they store data, and they will storedata disproportionately at large data centers with high capacity. Thiscauses underutilization of smaller data centers, and generally aless-then-optimal distribution of data.

Furthermore, even if application developers were to implement moreflexible designs without the complete-dataset-only limit, there areinherent inefficiencies by not coordinating among various applications.For example, if multiple large applications implement distributedstorage systems, the decisions about where to store data, when totransfer it, and so on, will be inefficient and may collide because eachapplication is competing for the same scarce resources (disk space,network bandwidth, etc) without coordination. Having a single unifiedstorage system allows replication decisions to be centralized, whichallows the most efficient possible allocation of resources.

Outline

A single deployment of a disclosed distributed storage system is calleda “universe.” A universe comprises multiple instances, which areindividual sub-nodes of a distributed storage system. Typically, therewill be one instance per data center, but this is not required. Eachinstance has zero or more chunk stores. A chunk store is an underlying,typically data-center-scale, storage system, in which a blob can bewritten. Note that a “blob” (i.e., a binary large object) is acollection of binary data (e.g., images, videos, binary files,executable code, etc.) stored as a single entity in a database. Thisspecification uses the terms “blob” and “object” interchangeably andembodiments that refer to a “blob” may also be applied to “objects,” andvice versa. In general, the term “object” may refer to a “blob” or anyother object such as a database object, a file, or the like, or aportion (or subset) of the aforementioned objects. Each blob at anypoint in time has replicas in one or more chunk stores around the world.Each instance also has a metadata table, which contains entriesdescribing individual blobs: the contents of each blob, who is allowedto access the blobs, where the replicas of the blobs are located, and soon. Instances come in two types, known as local and global. Thedifference is that local instances store metadata only for blobs whichhave replicas in one of the chunk stores of the instance, while globalinstances store metadata for all blobs. There are generally only a fewglobal instances in the universe.

Each blob is broken up into chunks, which are simply subsets of thecontents of the blob. In some embodiments, each chunk holds a contiguousrange of bytes from a blob. Blobs are broken into multiple chunks when asingle blob is so large as to be unwieldy if manipulated as a singleobject. For example, failure in replicating a single large blob would bemore likely to occur and more costly if it did occur (i.e.,retransmitting the entire large blob again). If the same large blob werebroken into many individual chunks, then no specific chunk would belikely to have a failure, and if one did fail, it would be inexpensiveto retransmit the single chunk that failed. Each chunk is identified bya chunk ID. In some embodiments, the chunk ID is a mathematical functionof the contents of the chunk. Embodiments that compute the chunk ID as afunction of the contents have content-based de-duplication because thesame content will always result in the same chunk ID. Note thatcontent-based de-duplication of individual chunks results inde-duplication of blobs only if the splitting of blobs into chunks isperformed in the same way for both blobs. In some embodiments, thesplitting into chunks is deterministic (i.e., there is no randomness),so two identical blobs would have identical sets of chunks. One of thefields of the blob metadata is the extents table, which maps logicalranges of byte positions within each blob onto individual chunks. Theactual chunk contents are stored in the chunk stores.

A single instance includes the following components:

-   -   A metadata table, which is a database containing the metadata        for each appropriate blob. In some embodiments, the metadata is        saved in a BigTable.    -   A blobmaster, which is a program that acts as the external        interface to the metadata table. A blobmaster provides functions        such as “please return the metadata for blob X.”    -   Zero or more chunk stores, which are storage systems such as        databases (e.g., BigTable), distributed file systems (e.g.,        GFS), or tape drive arrays. Inline chunk stores are a special        case where the actual content is saved in the metadata table.        Note that each chunk store belongs to a single instance. For        example, even when there are two instances at the same data        center, there are no shared chunk stores.    -   A bitpusher, which is a program which acts as the external        interface to the chunk stores. A bitpusher provides functions        such as “please return the contents of chunk X.”

The blobmaster and bitpusher “programs” (as well as most other programidentified herein) are meant in the sense of a distributed system. Eachof these “programs” comprises one or more tasks, where a task is asingle occurrence of the binary program executing on a particularmachine. For example, the bitpusher at a single instance may actually berunning on 100 different machines simultaneously, with each task runningthe same code. In some embodiments, each bitpusher task is responsiblefor a different subset of data. In addition, some embodiments assigntasks to virtual machines, and the mapping of virtual machines ontophysical machines is done by a distributed computing environment. Inthese embodiments, portions of independent tasks may be running on thesame physical machine at the same time.

In some embodiments, the partitioning of the blobmaster into tasks isdone on a per-blob-ID basis. That is, at any given moment, there is asingle blobmaster task responsible for each blob ID at that instance.This mapping of blob IDs to tasks, and the complicated handling ofdistributing load evenly, restarting failed blobmasters, etc. is handledin some embodiments by a BigTable coprocessor system. In general, thetask scheduling system for blobmasters must coordinate closely with thedatabase system that stores the metadata in order to guarantee that eachblob ID is assigned to a unique blobmaster task. The task schedulingsystem must also coordinate closely with the network communicationsystem used by clients to contact a blobmaster about a particular blob.

One special kind of chunk store is an inline chunk store, where thechunks are stored inside the metadata table along with the metadata forthe blob. Inline chunk stores are normally handled by the same codepaths as non-inline chunk stores, but data read operations from aninline chunk store are optimized specially. These stores are moreexpensive than other stores (e.g., because they don't providecontent-based de-duplication—the chunks are stored with each blob thatrequires them) but are significantly faster to access.

Each instance may also include one or more auxiliary components:

-   -   A replication module comprises one or more servers that maintain        a persistent queue of tasks to copy data from one instance to        other instances. In some embodiments, the replication module        maintains two or more independent queues to optimize processing.        These replication queues are sometimes referred to as        “repqueues.”    -   A tape master is an auxiliary server that helps the operation of        tape-based chunk stores. In general, tape-based chunk storage        uses two phases to read or write to tape, using an intermediate        read/write buffer that may be managed by a tape master.    -   A quorum clock server is an auxiliary server that simply reports        the current time according to that machine's internal clock. In        some embodiments, each instance has multiple quorum clock        servers to reduce the risk of problems associated with failure        or glitch in a single clock.    -   A statistics server is an auxiliary server that aggregates        information from bitpushers and replication queues around the        world about the current availability of capacity in chunk        stores, network bandwidth, etc.    -   A “life of a blob” server is a debugging tool that allows        developers and support technicians to examine the full history        of a blob, including all operations that create, read, write, or        replicate the blob, or chunks that comprise the blob. The full        history also includes changes to the metadata for a blob, such        as access rights.

The location assignment daemon, known as the “LAD,” is a system thatmakes decisions about background replication. The LAD always runs at asingle instance, which must be a global instance.

Embodiments of the disclosed distributed storage system use severalexternal systems for support. For example, a distributed storage systemmay use a configuration file distribution system, a load balancingservice, and an authentication system. A configuration file distributionsystem pushes out updates to configuration files in a safe way to all ofthe servers at all of the instances. This enables configuration to bemanaged at a single central location, while usage of the configurationinformation is done locally at each instance. A load balancing serviceroutes traffic to particular instances when there are choices amongmultiple instances. Embodiments of the distributed storage system reportto the load balancing service how much traffic is currently flowing toeach instance, and in return the load balancing service can answerquestions of the form “I have a request originating here, which needs totalk to one of the following instances. Which one would be best to use?”The underlying network protocol includes an authentication system sothat network calls into the distributed storage system can be reliablyassociated to the principals (i.e., users) making those calls.

Applications that wish to use embodiments of the disclosed distributedstorage system use a client library, which is a code library that isembedded in application programs. The client library defines the outsideAPI of the distributed storage system, providing operations such as“create a new blob with contents X” and “read the contents of thisblob.” In its simplest mode, the client library provides an API similarto that of a file system. The client library also provides more advancedAPI routines that are specific to embodiments of the discloseddistributed storage system. For example, a client can access specificgenerations or specific representations of a blob (explained in moredetail below). For example, the files used for a website (HTML pages,CSS files, JavaScript files, image files, etc.) may have multipleversions over time, and each of these versions could se saved asdistinct generations.

Reading a Blob

One common operation responds to a request to “read the contents of blobX.” In a simple mode of operation, a blob is identified by a blob ID,which is similar to a file name. For example, the string“/blobstore/universename/directory/subdirectory/blobname” could be theblob ID of a blob when the individual components of the string arereplaced by specific actual names. In some embodiments, the processworks as follows:

-   -   (1) The application calls the “read a blob” API function in the        client library.    -   (2) The client contacts a blobmaster. The client asks the load        balancing service to give it any blobmaster, which is commonly        the nearest blobmaster. The client asks the blobmaster for the        metadata for the blob.    -   (3) The blobmaster looks up the metadata. In the simplest case,        the desired blob is stored at the instance to which this        blobmaster belongs. The blobmaster examines the metadata and        verifies that, for example, the given user is authorized to view        the contents of the blob. If the user is not authorized, the        blobmaster returns an appropriate error message. If the user is        authorized, the blobmaster returns:        -   the metadata for the blob, which includes the mapping from            byte ranges in the blob to chunk IDs;        -   the list of chunk stores, which includes instance names, in            which replicas can be found (not just the current instance);            and        -   either a set of read tokens or the chunk contents. In            general, the blobmaster returns read token, which are            cryptographically signed tokens saying that the blobmaster            has authorized the given user to access the contents of            particular chunks (e.g., one read token per chunk). However,            in the special case that the blob is stored in an inline            chunk store at the instance, the blobmaster returns the            actual contents of the blob instead of read tokens.    -   (4) If the blob contains non-inline chunks, the client now        contacts a bitpusher. In some embodiments, the client asks the        load balancing service to give it any bitpusher belonging to an        instance at which the blob has a replica. Because the previous        load balancer call likely returned the closest blobmaster to the        client, and the current scenario assumes there is a replica at        that instance, the load balancing service will generally assign        a bitpusher belonging to the same instance as the blobmaster        that responded to the initial request. Although some embodiments        will always assign a bitpusher from the same instance as the        blobmaster in the current scenario, the more flexible        assignments provided by a load balancer can better optimize the        use of resources. The client sends read tokens to the bitpusher,        and the bitpusher returns the contents of the chunks.

The process of reading a blob is more complex if the blob is not presentat the instance that the client originally contacted. In someembodiments, the original blobmaster contacted may reside at a globalinstance, which holds all of the metadata for all of the blobs. In otherembodiments, clients can only contact only local blobmasters, and localblobmasters will contact global blobmasters when necessary. In someembodiments, connections from a local blobmaster to a global blobmasteruse a load balancing service to select an appropriate global blobmaster.In other embodiments, the small number of global blobmasters aregeographically dispersed, so each local blobmaster contacts a specificglobal blobmaster when necessary to find a blob. In the subsequentdiscussion, “initial blobmaster” and “initial instance” refer to theblobmaster and the instance originally contacted, which may be globalinstances.

When a desired blob is not stored at the initial instance, the blobmetadata is retrieved from a global instance. The global instance may bethe initial instance; otherwise, the local blobmaster at the initialinstance may query a global blobmaster. As noted above, contacting aglobal blobmaster generally uses a load balancing call. The globalblobmaster first determines if the desired blob exists and whether theuser has rights to access it. If the requested blob does not exist, orthe user does not have access privileges, the global blobmaster returnsan appropriate error message. If the blob does exist, and the user hasaccess rights, then the global blobmaster examines the set of locationsat which the blob is currently stored to develop a delivery strategy. Ifthere is a replica of the blob “close” to the client, then the strategyis generally to return the blob metadata to the client (either directly,or indirectly via the initial blobmaster), and everything proceeds asbefore. In this case, the client will access the blob at the identifiedclose replica.

If the nearest replica of the blob is “far” from the client, the globalblobmaster may instead choose to trigger real-time replication to copythe blob from a distant replica to an instance closer to the client.Real-time replication begins by picking a replica of the blob to act asthe “source replica,” and a chunk store belonging to the initialinstance (which is typically a local instance close to the client) toact as the destination chunk store. The initial instance triggersreal-time replication.

Part of the real-time replication process is to change the metadata ofthe blob to indicate that there is now a new replica at this initialinstance. The replication is flagged as being “real-time” and thereforegets the highest priority for the use of network links, etc. Of coursethis means that real-time replications are expensive operations. Much ofthe logic of background replication, described below, is designed tominimize the use of real-time replication. Another part of real-timereplication is the actual replication of the blob contents. In someembodiments, the replication module at the source instance creates aqueue entry for each chunk in the blob to replicate, and proceeds toreplicate the chunks. Because real-time replication has the highestpriority, the replication of these chunks typically occurs right away.

Once a dynamic replication starts, the process continues to completionregardless of the original request. That is, even if the original userrequest for the blob is rescinded, the replication does not stop. Someembodiments of the disclosed distributed storage system do not leaveblobs in inconsistent or incomplete states.

The initial blobmaster returns the new metadata to the client, and theread process continues as described above. Assuming that the clientdoes, indeed, read from this instance, (which is generally true), thebitpusher at this instance will both write the data locally to thedesignated chunk store (to create the new replica) and forward thechunks to the client. Both of these operations occur as bytes arrive atthe initial instance from the source copy. Conceptually, the client isreading from the remote instance, but simultaneously a local copy isbeing saved. The idea is that (a) because the distributed storage systemhas already paid the really expensive cost—the cost of copying data overa long-haul link—the system may as well create an additional copylocally; and (b) if someone has accessed the blob now, it's likely thatsomeone may access it again soon, so having a local copy will behelpful.

Note that the new replica created by real-time replication is identicalin every way to any other replica of the blob. The new copy is not aspecial, transient replica, and is not subject to more restrictedaccess. This new replica is identified in the metadata for the blob, soonce it is copied, any user with appropriate access privileges mayaccess this new copy.

In some embodiments, the full set of rules for deciding whether or notto invoke real-time replication can be more complicated than distancebetween the client and the source replica. In some embodiments,real-time replication rules may be specified as part of a blob'sreplication policy. Some exemplary factors that may be considered are:

-   -   distance from the client to the various replica locations;    -   the current status of various network links, storage systems,        and so on, which enable forming an accurate estimate the actual        cost of accessing the various replicas;    -   whether the user or owner of the blob has specified a policy        that deliberately prohibits or discourages real-time        replication. For example, the blob user or owner may know a        priori (either at policy-writing time or at the time of the        individual request) that this request is not likely to be        repeated again, so the cost of creating a new replica would be        wasted; and/or    -   whether the policy imposes a “hard constraint.” For example, a        blob should never be stored in the E.U. for legal reasons.

Some embodiments of the present invention provide more advanced forms of“reading a blob.” In some embodiments, the general blob reading API is aclass that provides the following functionality: (a) start a new blobreader that fetches the metadata for the blob; (b) read any particularsubrange of byte positions within a blob; or (c) return summarystatistics derived from metadata, such as the total size of a blob inbytes.

In some embodiments, the API provided by the client library implementsordinary POSIX file semantics, including “open,” “pread,” etc.

Some embodiments improve performance by having each bitpusher taskmaintain an in-memory cache of chunks that the bitpusher has recentlyprocessed. If there are multiple tasks at a particular instance, thenchunk IDs are preferentially assigned to a particular task by amathematical function of the chunk ID. This means, for instance, thatclient read requests for a particular chunk will attempt to contact thebitpusher task that is more likely to have cached the same chunk IDpreviously. This cache locality improves cache usage. The client willcontact another task only if the preferred task is overloaded orunavailable.

Writing a Blob

Embodiments of the disclosed distributed storage system are primarilydesigned for immutable or weakly mutable data, so these embodimentsgenerally provide a more restricted set of API functions for filecontent manipulation than most file systems. Specifically, someembodiments allow a user to create a blob, completely overwrite a blob'scontents, or delete the blob, but not partially modify the internalcontents of a blob. This is not a fundamental limitation, because anypartial modification of a blob's contents could be accomplished bydeleting the old version and creating a new blob with the desiredmodifications. Other embodiments do not impose these limitations, butmay internally implement changes as a delete plus a create. In terms ofPOSIX file semantics, the embodiments that impose these limits supportthe modes “r” and “w,” but not, for example, “r+.”

The simplest form of writing a blob creates a new blob. The process ofoverwriting an existing blob is described below. The description hereillustrates the operations performed to write a blob in embodiments of adistributed storage system, but is not intended to be limiting. One ofordinary skill in the art would recognize that many variations ofdisclosed operations are possible within the scope of the disclosedteachings.

A user application begins writing a blob by instantiating a “blobwriter” object. The blob writer object is capable of creating (orreally, overwriting) a single blob. The application repeatedly calls awrite function, passing data to the blob writer. In some embodiments,the write function permits the user application to specify that “thefollowing data should start at offset X within the blob.” This issyntactically analogous to POSIX pwrite( ). Higher-level API functionswithin the blob writer object expose behaviors analogous to POSIX write() etc. Note that it is an error to write to a data range of a blob thathas already been written.

In some embodiments, the client buffers writes, so that the client candecide on the most natural partitioning of the written data into chunks.In some embodiments, the partitioning optimizes both content-basedde-duplication and keeps the number of chunks small. Typically, having asmaller number of chunks makes the underlying storage more efficient. Insome embodiments, the partitioning divides each blob into chunks of afixed size. Some embodiments use Rabin-Karp chunking or other complexalgorithms.

The client decides which type of chunk store should be used to write thedata. The selection is based on the data being written as well as theblob policy the user selects for the blob. Some policies are veryexplicit about the type of data store. For example, “always write theseblobs to inline-in-memory chunk stores” would be an appropriate policyfor a blob that needs to be accessed very quickly. Other policiesprovide a range of options based on blob characteristics. For example,some embodiments include a “standard disk” policy that writes todifferent stores depending on the size of the blob: blobs whose totalsize is less than one threshold are saved to an inline store; blobsbetween the first threshold and a second threshold are saved to aBigTable-based store; and very large blobs with size greater than thesecond threshold are saved as chunks in a distributed file system store.This allocation based on size works in some embodiments becausedifferent chunk stores can handle different sizes better. For smallblobs, the overhead cost of storing to inline chunks is low and theefficiency gain is high; a BigTable-based store is generally efficientbut may have trouble handling very large data; and the distributed filesystem store (using GFS, for example) is very good at handling largedata, but has a high overhead per datum and thus is inappropriate forsmall data.

When the cache is full, or when the application explicitly calls aFlush( ) method on the client, the client actually writes the data to adata store. The actual write to a data store is accomplished bycontacting a bitpusher (selected by load balancing) and writing data. Ingeneral, the bitpusher is near the client. The bitpusher verifies thatthis user is allowed to write, and then actually writes the chunk. Inembodiments that implement content-based de-duplication, the chunk isnot written if the chunk is already present. The bitpusher returns tothe client a write token for each chunk. In some embodiments, a writetoken is a cryptographically signed token indicating that a certainchunk was written to a specific chunk store, as part of a certain blob,etc. Inline chunks are written through this code path as well, but donot perform content-based de-duplication.

Either at the end of blob writing, or when the application explicitlycalls a FlushMetadata( ) method on the client, the client writes themetadata for this blob to a metadata store. The client contacts ablobmaster (selected by load balancing or based on the instance(s) wherechunk data has been written) and tells the blobmaster that that it iswriting to a particular blob ID. The client passes various informationto the blobmaster: all of the write tokens that it has received;structural and access control information about the blob such as itsextents table; and the relevant blob policy. As soon as this data iswritten to a local instance, read operations that arrive at this localinstance will be possible. In addition, this change to metadata will bepropagated to other relevant instances as soon as it is written. Thechanges to the metadata are replicated by the metadata replicationsystem. Metadata replication is discussed below, and in more detail inco-pending application U.S. patent application Ser. No. 12/703,167,“Method and System for Efficiently Replicating Data in Non-RelationalDatabases,” filed Feb. 9, 2010, which is incorporated herein byreference in its entirety.

In some embodiments, the client calls a Finalize( ) function on the blobwriter when it is done writing a blob. The call to Finalize( ) will alsooccur automatically if the blob writer object is deleted before theFinalize( ) method is called. The process of finalizing performs severalimportant operations. First, finalizing flushes the client's databuffer, to guarantee that all of the blob contents are physicallywritten to a data store. Second, as part of finalizing, the clientdecides where the initial location of the blob should be. In the commoncase where all chunks were written to the same chunk store, the locationis that chunk store (and the instance where that chunk store islocated). If chunks were spread over multiple chunk stores, the clienttypically picks the chunk store that received the majority of the bytes,or the greatest number of bytes. Because chunks are not necessarily thesame size, having the majority of bytes is not necessarily the same ashaving the greatest number of chunks. If a blob is large and thebitpushers were highly loaded during the write process, the chunks maybe distributed across multiple targets. Similarly, if the upload took along time, and during that time a particular instance became temporarilyunavailable, the writes would have gone to an alternative bitpusher at adifferent instance. As these examples illustrate, in the process ofwriting a blob, individual chunks may be written to different chunkstores within one instance, or different chunk stores at differentinstances.

As part of the finalizing process, the client flushes its metadata.Along with this metadata update flush, the client sends a command to“finalize the blob at instance Z.” When a metadata update to “finalize”is received by a blobmaster, several things happen, includingdetermining whether all of the chunks of the blob are already present atthe chosen destination location. (This is the common case!) If so, theblobmaster immediately marks the blob as finalized. At this point,future modifications to the contents of this blob are forbidden. In someembodiments, all of the chunks must be saved to the same chunk store atthe destination instance in order to immediately mark the blob asfinalized. In these embodiments, all of the chunks must be consolidatedinto a single chunk store prior to designating the blob as “finalized.”

If the blob cannot be immediately finalized, it is instead marked as“finalizing.” Future modifications are immediately forbidden, and theblobmaster at the destination instance triggers chunk replicationoperations to copy chunks from wherever they may be to the chosendestination. In particular, when the metadata update that triggersfinalization arrives, either directly from the client or via themetadata replication system, at the blobmaster for the instanceresponsible for the chosen destination chunk store, the blobmaster atthat instance will trigger the copies. Other blobmasters will note thatthe blob is finalizing, but not trigger any copies. As chunks arereplicated successfully, the replication module writes further metadataupdates for the blob, indicating that chunks are present. As each chunkis received, the blobmaster determines if all of the chunks identifiedin the metadata are present. When all of the chunks are finally at thedesignated instance, the blobmaster marks the blob as finalized.

Regardless of whether a blob could be finalized immediately, or requiredreplication of one or more chunks, the blobmaster makes a call to thebackground replication system as soon as the blob is finalized. This isexplained in more detail below in the section on “backgroundreplication.”

Overwriting a Blob

Overwriting blobs is closely related to blob generations. Blobs storedin embodiments of the disclosed distributed storage system comprise oneor more generations. A generation is effectively a version of the blobcontents. Each time the blob is overwritten, the old generationcontinues to exist, but a new generation is created. Each generation ofa blob has a generation ID. In some embodiments, the generation ID is a64-bit integer. In some embodiments, the default generation ID is thetimestamp at which the generation was created, with theleast-significant bits containing some tiebreakers to resolveambiguities if multiple servers try to write data within the samemicrosecond. In some embodiments, clients are permitted to override thedefault ID with any selected unique value (the client could not use thesame generation ID for two distinct generations of the same blob).

The description of writing a blob above actually applies to a singlegeneration: it is an individual generation of a blob that may beuploading, finalizing, or finalized; and an individual generation hasreplicas at various locations, etc. Read operations most commonly askfor the most recent generation, and thus the generation returned maydepend on which instance is queried at the start of the read operation.Due to latency, different instances may know about different subsets ofgenerations, and thus the “most recent” generations at differentinstances may be different. (This scenario exemplifies the “eventualconsistency” of weak mutability addressed above.) Read operations mayalso ask to see a specific generation, or even to see the metadata forall generations. In some embodiments, a write operation invariablycreates a new generation. When a blob writer object is first created,its arguments include the blob ID to be written, and optionally thegeneration number that should be assigned. As noted above, someembodiments automatically use a timestamp as a default generation ID.

In some embodiments, there is a location assignment daemon (LAD) thatcoordinates the planet-scale behavior of blob generations over the longterm. The LAD may relocate individual generations of a blob to differentinstances, or delete specific generations according to a blob's policy.For example, a typical policy specifies keeping N generations of a blob(N can equal 1), so the LAD may delete all generations beyond the firstN. The LAD comprises multiple processors running in parallel so that theentire set of blobs can be reviewed in 4 to 8 hours.

The term “generation” is appropriate because two generations of a blobare related but different from one another, and there are only a certainnumber of generations alive at any time.

In addition to generations, embodiments of the present invention includeseveral other advanced metadata concepts, including references andrepresentations. References act like hard-links in a file system. Eachblob has one or more references, and when the last reference to a blobis deleted, the blob itself is deleted. In general, a blob is initiallycreated with a single reference. In some embodiments, each reference hasits own access control lists, policies, and so forth. In someembodiments, one of the references to a blob is designated as thedefault reference, which is generally the original reference when theblob was created. A read request is actually a request to read aparticular reference to a blob. If no reference ID is specified, thedefault reference is assumed. In some embodiments, reference IDs arestrings, which may be fixed length or variable length. In otherembodiments, a reference ID is an integer, such as a 32 bit or 64 bitinteger. In some embodiments, the reference ID is part of the blob ID,so a blob ID may have the form/blobstore/universe/directory/subdirectory/blobname:referencepathimorepathireferencename.The some embodiments, the default reference ID is the empty string. Theuse of the empty string as the default reference enables simplified blobIDs.

It is useful to note that “references” and “generations” are distinctindependent attributes of a blob. Each reference refers to the wholeblob, which includes all generations of the blob. As new generations arecreated, the same references apply to the new generations. In addition,as new references are created, the references apply to all of thegenerations. References and generations are effectively orthogonalattributes.

In some embodiments, references are deleted by issuing a metadata changethat marks a particular reference with a “tombstone” time. A tombstonetime is a timestamp that specifies when the physical reference willactually be deleted. For example, the tombstone time may be 30 daysafter being marked for deletion. References with tombstones are normallyconsidered to be “deleted” for the purpose of ordinary reads, butreferences with tombstones can still be accessed and undeleted bycertain “superusers.” The existence of superusers provides a safetymechanism against accidental deletion. Once the tombstone time ispassed, the reference is actually removed, and if this is the lastreference to the blob, the entire blob is deleted. This is described inmore detail below with respect to “tombstone expiration.”

Another important concept for blobs is “representations.” Conceptually,representations identify distinct ways to view or format the same pieceof information. For example, a digital photograph may have onerepresentation that is a full-size high-resolution image, and a secondlow-resolution thumbnail image. In some ways, representations are likedifferent language translations of the same book. In some embodiments,representations are managed by coprocessors, which operate in parallelwith the functionality described above. Note that “coprocessors” here donot inherently refer to CPU or hardware coprocessors, although in someembodiments, the coprocessor functionality is fully or partiallyimplemented in CPU/hardware coprocessors. In the blob hierarchy, eachblob has one or more generations, and each generation has one or morerepresentations. In general, each blob generation has only a singlerepresentation.

To summarize, the overall metadata structure for blobs comprises threecomponents:

-   -   one base metadata entry per generation. This generation entry        contains data for each representation. The representation        entries describe the contents of the blob, including the extents        table that identifies the chunks, and offsets to each chunk.    -   one reference metadata entry per reference. This reference entry        contains access control lists (e.g., who has access), policies,        etc.    -   any inline data saved for the blob, with one entry per chunk.        Each inline entry is associated with a unique generation and a        unique representation. In some embodiments, there is no re-use        of inline chunks.

Replication

Embodiments of the disclosed distributed storage system store both blobsand metadata for the blobs. The blobs may be very large, and compriseuninterpreted binary data, whereas the metadata for each blob is small,and comprises a well-defined set of attributes. Moreover, blobs existbecause end users access the contents of the blobs (directly orindirectly), whereas metadata for blobs exists to facilitate access toblob contents. For these and other reasons, replication of metadata usesa different mechanism than replication of blobs. Both forms ofreplication are described below. In general, “replication” will refer toblob replication unless the context clearly indicates otherwise. In bothforms of replication, there is a source instance that provides the datato be copied, and a destination instance, which is the target for thecopy. For blob replication, one or more destination chunk stores must beselected to store the chunks that are copied.

Blob replication can be triggered in multiple ways. Blob replication iscalled implicitly both by real-time replication and backgroundreplication. In some embodiments, replication can be called directly bya function in the client library API.

In some embodiments, blob replication begins by a call to theReplicateBlob( ) function at a blobmaster. In some embodiments, thefunction call to replicate occurs at the blobmaster for the destinationinstance. That is, the call is made to a blobmaster at the instanceresponsible for the destination chunk stores. In alternativeembodiments, the call to begin replication occurs at the instance thatwill act as the source for a copy of the blob. In some embodiments, thearguments to the ReplicateBlob( ) function include the blob ID, thesource instance, and the priority for the copy. In embodiments whereReplicateBlob( ) calls always occur at the destination instance, thefunction call need not specify the destination instance (it is implied).In some embodiments, the destination instance is included as an argument(or an optional argument) in order to provide greater flexibility aboutwhich blobmaster to call. The priority is assigned based on the type ofrequest. For example, real-time replication has a high priority and isallowed to use a high network priority as well, because it is generallyin response to a real-time request from an end user. Backgroundreplication tasks have varying priority determined by the LAD, butvirtually always use a low network priority because they are nottime-sensitive.

In some embodiments, the destination blobmaster (or the blobmaster thatreceived the call to replicate) contacts the source blobmaster and asksit to initiate the source operation. This initiation at the source issometimes referred to as “metadata pinning.” The term “pinning” is usedto indicate that during the replication process, the source copy of theblob is not allowed to be removed. While a blob is being copied from oneinstance to another, there is essentially a single copy of the blob that“spans” two instances. Once the replication is complete, there are twoindependent replicas, which are individually subject to deletion andremoval. At the completion of the copy operation, the metadata for theblob is updated again to indicate that the copy operation is complete.

In some embodiments, a blobmaster at the source instance prepares forreplicating chunks of a blob by making an immediate change to the blobmetadata at the source instance. The change indicates that there is nowa new replica of the blob at the destination instance. The state of thenew replica indicates that replication is in progress from the sourceinstance. This is sometimes referred to as being “in-flight.” The sourceinstance writes this change to its own metadata table. This change tothe metadata at the source is important for several reasons, such aspreventing removal of the source copy before the copy operation iscomplete. In particular, the background processing of the LAD coulddetermine that the copy of the blob at the source is no longer needed;the change to the metadata indicates that the replica of the blob at thesource instance is in use by a pending copy, and therefore this replicamay not be removed.

The source instance transmits the entire metadata for the blob to thedestination instance. In some embodiments, the metadata is copied as-is.In alternative embodiments, the metadata for the blob is converted to asequence of one or more metadata mutation operations, as used in typicalmetadata replication. The mutations (also known as deltas) are then sentto the destination instance. The use of deltas to transmit the metadatafacilitates general metadata replication because there are no collisionsbetween the different replication methodologies. The use of deltas alsofacilitates compaction, which is described in more detail below.

After the destination instance receives the metadata for the blob, thedestination blobmaster initiates chunk replication by informing itslocal replication manager. The destination replication manager sends a“replicate chunks” command, which specifies the chunks to be copied, thesource and destination for the chunks, and sometimes various auxiliaryinformation such as priorities. In some embodiments, the command toreplicate chunks specifies the chunks stores where the chunks arecurrently stored at the source instance as well as the chunk stores tostore the chunks at the destination instance. In some embodiments, thedestination chunk stores are determined by the blob policy, and are thusnot included in the replicate command. In some embodiments, identifyingthe specific chunk stores is optional, with storage determined based onpolicy if the chunk stores are not specified.

The replication manager either executes the replicate commandsimmediately, or places the commands in a replication queue. In someembodiments, the replication queues are “stable.” That is, once thereplication manager acknowledges to the blobmaster that a command hasbeen queued, the replication manager promises to execute the command,even if the replication manager or the queues managed by the replicationmanager fail before completing all of the commands. For example, someembodiments save the replication queues in persistent storage.

The replication manager maintains a priority queue of logical copyoperations. Each queue entry specifies the chunks to be copied, thesource instance, the destination instance, the network quality ofservice, the requesting user, and the priority. The priority is passedto the replication manager as part of the replication request. Sometimesthe copy operations are referred to as “links” because multiple linksmay be used to copy chunks from an original source to the finaldestination. In some embodiments, each queue entry corresponds toexactly one chunk; in other embodiments, a single queue entry mayspecify a list of chunks. In some embodiments, the replication managerdetects the presence of duplicate requests, which would include requeststo send the same chunk from the same source to the same destination. Insome embodiments, entries are considered duplicates only if they havethe same network quality of service, requesting user, and priority aswell. In embodiments that detect duplicates, one of the duplicates isselected to process, which may be the one with the higher quality ofservice, the one with the higher priority, or the one that was insertedinto the replication queue earlier. In these embodiments, the duplicatesthat are not selected may be deleted, or placed into a holding stateuntil the chunks are copied based on the selected queue entry. Thenetwork quality of service (QOS) may determine the speed of transfer,and can be used to determine which processes are abandoned when anetwork communication link becomes overloaded. The quality of servicecan be specified by the end user or the client library. Of course, ahigher quality of service costs more, so the requester must determine ifthe benefit of higher quality is worth the additional cost.

The replication manager executes a replicate command by asking the localbitpusher to pull the data from a remote bitpusher, and to write thedata locally as soon as it arrives. When the replication manager hasfinished a copy operation, a metadata change is written to theblobmaster, which indicates that the new chunk is present.

In an alternative embodiment, if both the source and destination of thereplication are inline chunk stores, then the data is copied as part ofthe metadata replication.

Each blobmaster periodically examines its metadata, and determines theeffects of recent metadata changes. In some embodiments, this periodicexamination of metadata accompanies compaction analysis, because bothreview metadata changes. During this examination, if the blobmasterdetermines that all chunks of a replication have arrived, the blobmastermodifies the information for the replica to remove the annotation that acopy is in progress. This would allow the source replica to be removedlater, if the LAD or other process decides that this replica is nolonger needed.

Metadata Replication and Compaction

In some embodiments, blobmasters perform two special related tasks withmetadata: blobmasters replicate changes in the metadata to otherinstances and compact the changes at their own instance. Replicationpropagates changes to the metadata to every other instance that needs totrack the changes. In general, the changes must be propagated to everyglobal instance and each local instance that has a copy of the blobwhose metadata has changed. Because the changes are stored andreplicated as deltas, some embodiments periodically compact the changesto provide faster access to the data and reduce the storage space usage.The compaction process merges information about changes into theunderlying base data. The operations of replication and compaction areinterrelated in some important ways.

To understand metadata replication and compaction, it is useful to knowhow metadata is stored. For each blob, the metadata table contains boththe current “merged” state of the metadata, and a sequence of zero ormore metadata delta records. Each metadata update is implemented bywriting a new delta record, which efficiently captures just the changes.The updates are done as “blind writes,” without database locks andwithout a read-modify-write cycle.

One attribute of a metadata delta is a sequence identifier. In someembodiments, sequence identifiers are globally unique, which provides awell-defined unique ordering of the metadata deltas. In someembodiments, sequence identifiers are fixed-length binary strings, butother embodiments use a variable-length string, a 64-bit integer, orother appropriate data type. A sequence identifier is also referred toas a “sequencer,” because it specifies where each delta falls in theglobal ordering of deltas.

In some embodiments, a sequence identifier comprises a timestamp and atie breaker. The timestamp indicates when the delta was created. In someembodiments, the timestamp is the number of microseconds since thebeginning of the current epoch or other well-defined point in time. Insome embodiments, the timestamp is assigned by the blobmaster thatreceived the metadata update. Generally, the timestamp is assigned atthe moment the update is received. In some embodiments, one or morespecial clocks are used to assign these timestamps. Some embodiments usea “stable clock system” as described below.

A tie breaker uniquely identifies the blobmaster that issued thetimestamp. As noted above, the blobmaster functionality at an instancemay be performed by many different blobmaster tasks, each of which mayassign tie breaker values to sequence identifiers that it generates.Therefore, some embodiments compute a tie breaker value as amathematical function of both the physical machine on which theblobmaster task is running, and the UNIX process ID assigned to thetask. In some embodiments, the tie breaker value is computed as afunction of additional values, such as the instance identifier of theinstance where the blobmaster task is running. By combining both thetimestamp and a tie breaker to form a sequence identifier, when a singleblobmaster task issues two successive sequence identifiers, the secondone will be strictly greater than the first one. Also, because of thetie breaker values included in sequence identifiers, the sequenceidentifiers are globally unique. In particular, if a single blobmastertask is restarted, or if two blobmasters act on the same blob atdifferent instances, they are guaranteed to generate different sequenceidentifiers.

Sequence identifiers constructed with timestamps and tie breakers haveseveral useful characteristics. Because of the timestamp portion ofsequence identifiers, the sequence identifiers are at leastapproximately in the natural order because the system clocks on thevarious computers maintain roughly the same time. That is, sequenceidentifiers create a stable, well-defined sort order for deltas. Becauseof this, the order of operations is defined to be the order created bythe sequence identifiers, regardless of the “actual” order in the realworld. To guarantee the approximate natural order of metadata deltas,some embodiments include programs, processes, or policies to preventexcessive divergence of the time clocks throughout the distributedstorage system.

In some embodiments, each delta specifies the instance where the deltawas created. That is, the instance of the blobmaster that initiallyreceived the delta. This is the instance that will be responsible forreplicating the delta to all other relevant instances. The combinationof sequence identifier and instance of origin for a delta is sometimesreferred to as the provenance of the delta.

A metadata merger program is used to read metadata so that the mostcurrent metadata is returned to each requestor. The metadata mergerprogram starts with “merged” base metadata. The metadata merger programthen applies each of the associated zero or more deltas, in order, tothe base metadata, to produce final merged metadata. In this way,metadata reads always get the most current information that is availableat the instance, even if new deltas have been inserted. Whenevermetadata is read by a blobmaster, the blobmaster reads both the mergedbase metadata and all of the associated deltas, passes them through thismerger process, and returns the final result to the caller.

This use of deltas guarantees that future reads at the same blobmaster(even if a distinct blobmaster task) will correctly reflect theindicated change as soon as a delta is written to the system. The use ofdeltas also has the consequence that deltas accumulate over time andslow down reads. Therefore, it is desirable to incorporate deltas intothe merged base metadata as soon as possible. The process ofincorporating deltas into the corresponding base value and deleting themerged deltas is called compaction.

In some embodiments, each blobmaster continually runs a maintenancecycle in the background, which examines each blob in its metadata tablein turn. This maintenance cycle handles both compaction and replication.In alternative embodiments, blobmasters run a maintenance cycle on aperiodic basis, such as every hour, or every 10 minutes. In someembodiments, the maintenance cycle is managed by a process other thanthe blobmaster. While some embodiments address compaction andreplication in the same maintenance cycle, these two processes can beimplemented separately.

In some embodiments, deltas are grouped together into two dimensional“shapes.” In general, a shape comprises one or more rectangles. One ofthe dimensions comprises sequence identifiers, and the other dimensioncomprises blob IDs. Each delta applies to a unique blob (i.e., there isa unique blob ID), and has a sequence identifier, so each deltacorresponds to a unique point in this two-dimensional delta space.Conversely, each point in this two-dimensional delta space correspondsto at most one delta. The deltas in this two-dimensional space are verysparse. Some embodiments provide data structures and routines toimplement geometric shapes on this space, and perform ordinarycomputational geometry tasks on the shapes, such as intersections,unions, and set theoretic differences.

In order to track metadata replication, some embodiments maintain anegress map and an ingress map. An egress map tracks deltas that havebeen transmitted to, and acknowledged by, other instances. In someembodiments, the egress map uses shapes as described above to identifythe deltas that have been transmitted to and acknowledged by otherinstances. An ingress map tracks deltas that were transmitted fromanother instance to the current instance, and acknowledged by thecurrent instance. In some embodiments, the ingress map uses shapes asdescribed above to identify the received deltas. The ingress map at Afrom B should be the same as the egress map at B for A because bothrepresent the set of deltas transmitted from B to A and acknowledged byA.

In some embodiments, the blobmaster backs up the state of the egress mapin the metadata database. Although generally reliable, the consequenceof losing data from the egress map is simply that the blobmaster willretransmit some data unnecessarily. (Each delta will be inserted onlyonce, even if the same delta is retransmitted.) When a blobmaster startsup, it reads its egress map from the metadata database, and sets up itsingress map by contacting all of its peer blobmasters at other instancesto retrieve data from their egress maps.

When the maintenance cycle processes a row in the metadata table, itfirst determines how many of the deltas can be merged into the base datawithout risk of creating inconsistencies between different instances.The compaction horizon specifies the upper limit of sequence identifiersthat may be compacted. The blobmaster can safely compact any deltas withsequence identifiers less than this value (i.e., merge the deltas intothe merged base metadata, and discard the deltas). In some embodiments,the blobmaster can safely compact any deltas with sequence identifiersless than or equal to the compaction horizon.

Generally, it is safe to compact a specific delta if two conditions aresatisfied:

-   -   The blobmaster knows with certainty that it will never receive        another delta for this blob with a sequence identifier less than        the sequence identifier of the specific delta. In general, the        order of the deltas is important, so they need to be applied in        sequence identifier order; and    -   If the specific delta was created at the current instance, the        blobmaster must know that this delta has already been replicated        to all other appropriate instances, and the replication has been        successfully acknowledged. After merging, the delta will be        gone, so it must be transmitted to the other instances first.        Note that this applies only to deltas created at the current        instance because the current instance is responsible to        replicate deltas created at the current instance.

In some embodiments, the compaction horizon is computed using the egressand ingress maps. An illustrative calculation of a compaction horizonfor a given blob performs the following calculation for each otherinstance and for each ingress and egress map. (For example, if there arefifty other instances, then the following calculation is performed 100times.) Compute the least sequence identifier not present in themetadata row associated with the blob ID. That is, if S is the shape forthe deltas in the map, look at just the sequence identifierscorresponding to the given blob ID. Find the least sequence identifiernot in this set.

The same operation is performed for each other instance and each ingressand egress map. The compaction horizon computed by this method is theminimum of all the individual calculations. This is a valid compactionhorizon because (a) any delta received via metadata copying in thefuture will have a sequence identifier greater than or equal to thisvalue (this follows from the use of ingress maps for each instance); and(b) all deltas with sequence identifiers less than this value havealready been replicated to every other instance (this follows from theuse of egress maps). It is noted that any future metadata changesassociated with the given blob at the current instance will havesequence identifiers greater than the computed compaction horizonbecause sequence identifiers are monotonically increasing.

In some embodiments, the calculations above are limited to instancesthat have metadata for the identified blob (i.e., instances that havereplicas of the blob as well as all global instances). In otherembodiments, the calculations above for ingress maps are limited toinstances that have replicas of the identified blob.

The compaction process is described in more detail below with respect toFIGS. 8, 12A, 12B, 15A, and 15B. Although the above descriptioncalculated the compaction horizon for a single metadata row, someembodiments apply the same process to groups of metadata rows, which maycomprise contiguous ranges of blob IDs.

In some embodiments, the maintenance cycle regularly computes atransmission plan, which is a map from shapes in delta space to sets ofinstances that need to receive the designated deltas. For each entry ina plan, the maintenance cycle maintains a queue. Deltas may be writteninto this queue, and whenever enough bytes have been written to aparticular queue, the queuing system immediately initiates atransmission of metadata to the appropriate destinations, and updatesits egress maps at the appropriate time (the ingress maps are updated bythe destinations). In alternative embodiments, the maintenance cycleruns at periodic times, and initiates a transmission of metadata as partof each cycle.

In summary, some embodiments of the disclosed distributed storage systemhave a continuously running maintenance cycle that executes thefollowing operations for each blob's metadata:

-   -   Compute the compaction horizon.    -   Create a metadata record in memory for this row, which includes        the base metadata values for each individual data item.    -   For each delta, determine if it is in the transmission plan for        any destinations. If so, add the delta to the appropriate        queue(s).    -   If the delta's sequence identifier is less than the compaction        horizon, apply the delta to the metadata record in memory, and        mark that delta cell for deletion.    -   Once all of deltas with sequence identifiers less than or equal        to the compaction horizon have been processed, some embodiments        perform special computations that happen at delta compaction        time. Some embodiments use this opportunity to perform special        computations because each delta is compacted exactly once in its        lifetime at each instance. Therefore any actions that need to        occur once per delta are typically scheduled to occur at the        time the delta is compacted. Some embodiments perform various        combinations of the following actions at compaction: (1) if any        reference tombstones have expired, then the indicated reference        is removed. If this was the last reference, then the blob as a        whole (all generations) is marked for deletion; (2) if any        deltas have caused a user's usage of the storage system to        change (e.g., the user has written new data to the system that        causes a change to “accounting”), then the system records all of        the relevant changes to usage; (3) if the blob metadata is no        longer needed at this instance, because the blob has no replicas        at this instance and the instance is not global, then the entire        blob (all generations) is marked for deletion at this instance.        If it is later discovered that there are uncompactable deltas,        the marking for deletion is undone; (4) if a delta indicates        that a particular generation ought to be removed from this        instance, then that generation is marked for deletion.    -   Once all deltas have been processed, changes in usage and        updates to the metadata database are recorded.

Although compaction is important for reading efficiency, and provides anopportunity to perform other once-per-delta activities, compaction doesnot affect the consequences of a read operation. The deltas are eitherapplied to the base metadata during a read operation, or were alreadyapplied to the base metadata during compaction. Because of this, someembodiments implement compaction as a background operation.

Metadata transmission plans can be modified to improve the overallefficiency of replicating the metadata to other instances. In someembodiments, every so many rows of work, the metadata replication systemdraws a rectangle in delta space identifying a range of blob IDs and arange of sequence identifiers. The range of sequence identifiers isbounded by the infinite past on one side and the current sequenceidentifier value on the other. The rectangle is selected so that oncethe maintenance cycle has reached the bottom of the blob ID range, everydelta in the rectangle will have been replicated to every other relevantinstance. The system then compares this rectangle to each egress mapentry, in turn, to see what metadata deltas still need to be transmittedto each instance. The system then merges and/or modifies thetransmission sets for individual instances for optimal delivery. Forexample, if the set of deltas to send to instance X is nearly the sameas the deltas to send to instance Y, an optimal transmission plan maysend the common set of deltas to both X and Y, and the small differenceto just X or Y. For each set of deltas being sent, the system designs atransmission plan that uses “tree distribution” to minimize the amountof network traffic needed. This works particularly well when the set ofrecipients for the same set of deltas is as large as possible.Transmission plans, and how they may be optimized, are described in moredetail below with respect to FIGS. 15A and 15B.

Background Replication

Much of the efficiency provided by embodiments of the discloseddistributed storage system comes from choosing replica locations well.Having well-placed replicas minimizes the need for real-time replicationand other network use. Furthermore, users will use less storage if (i)they can set policies for less-needed blobs and be confident that thesepolicies will be obeyed, (ii) they have sufficient data integrityguarantees (e.g., making sure that there are enough backups), and (iii)they have confidence that the system will dynamically add replicas forblobs that need it. Therefore, intelligent decisions about where blobsought to be stored reduce both network usage and disk space usage.

All non-real-time decisions about additions or deletions of blobreplicas are made by a module known as the location assignment daemon,which is sometimes referred to as the LAD. The LAD is conceptually asingle program that runs continually or periodically to scan themetadata for all blobs. For each blob the LAD makes decisions aboutwhere, if anywhere, replicas ought to be added or removed. In anexemplary implementation, the LAD runs as a single (multi-tasked)program at a global instance or an instance that is geographically closeto a global index. In other embodiments, multiple smaller LADs run atvarious locations, and these smaller LADs send their recommendations toa central clearinghouse for collective evaluation and execution. In someembodiments, the central clearinghouse just executes the individualrecommendations; in other embodiments, the central clearinghouseevaluates each of the individual recommendations in the context of theentire distributed storage system, and makes decisions on the individualrecommendations based on overall resource constraints.

The reason for the central clearinghouse is that the LAD is the onlysubsystem that is ever allowed to remove a replica of a blob. Withoutcentralized control, there would be very tricky synchronization issues.For example, if there were two LADs and two replicas of a certain blob,each LAD could independently decide that it's safe to remove one, andthey could remove different ones, eliminating all replicas of the blob.

LAD decisions are based on policies for each individual blob. In someembodiments, blob policies are specified by a set of predefinedattributes. Other embodiments provide a blob policy expression language,which allows greater flexibility in defining blob policies. Otherembodiments provide a hybrid approach, including both a predefined setof attributes and an expression language for more complex policy needs.

Embodiments of the LAD have multiple possible implementations, but theimplementations have a basic structure in common. The LAD processorsexamine each blob in some specified sequence. Some embodiments processthe blobs in a random or pseudo-random order; some embodiments processthe blobs in alphabetical order by the names assigned to the blobs; someembodiments perform a quick first-pass prioritization, then process theblobs in that priority order. For each blob, LAD implementations look atthe current set of replicas of the blob and the replication policy forthe blob.

For each chunk store, a blob may be in one of four states: absent,present, present and acting as the source of a copy, or the destinationof an active copy operation.

A blob may have multiple references, and each reference may have apolicy. The policies must be mergeable in a meaningful way—basically,the “policy for a blob” is the union of the policies for each of thereferences. For example, if one policy says “two replicas on-disk, oneof which must be in the western United States” and another says “threereplicas on-disk, one of which must be in Europe,” the merged policywould be “three replicas on-disk, one of which must be in the westernUnited States, and another of which must be in Europe.”

The LAD compares the current state of the blob to the policy, anddecides whether it should add or remove any replicas. Generally, thebasis for such a decision is to compute the cost and the benefit of anysuch operation. Benefits include improved compliance with a policy,positioning of a blob closer to where it is expected to be accessedsoon, or reduced storage cost if eliminating a no-longer-needed replica.Costs include storage costs and network transit costs. The expectedprofit is the difference of these two, and if the profit is positive, itestablishes the priority for performing this particular operation.

The suggested operations are then inserted into a priority queue andexecuted, either as-they-come or in batches. When there is a request toadd a new replica, the profit computed by the LAD is used as thepriority for the ReplicateBlob( ) operation.

Although the basic structure of the LAD is common across multipleimplementations, there are some noteworthy differences. As noted above,different LAD implementations may process the blobs in different orders.Another difference is the set of algorithms used to assign costs andbenefits for each proposed operation. Some LAD embodiments use a simplerule-based algorithm, such as “if the number of replicas is less thanthat specified by the policy, then adding new replicas which would matchthe policy are worth a fixed benefit of X.” Other LAD embodimentsimplement a continuous auction of storage resources, where costs aredetermined by an “open market” of storage and network capacity, andindividual blobs act as “bidders.” Another “cost” that is added in someembodiments is a transaction overhead, which prevents moving replicasfrom one instance to another because of a small benefit. Without theconsideration of overhead costs, there could be “oscillation” of areplica back and forth between two instances. This auction methodologygenerally provides a better allocation of storage and network resourcesbecause it considers the overall advantages for groups of blobs ratherthan doing the analysis for each blob in isolation. Finally, differentLAD implementations execute operations in different ways. For example,some implementations execute operations singly, whereas otherembodiments execute operations in batches.

Some embodiments provide a LAD simulation system. A LAD simulationsystem works by running the real LAD against an artificial world. Theinputs are a statistical summary of the current state of real blobs(rather than the complete table of blob states, which is very large),and a configuration indicating a sequence of events that may happen atvarious times in the future. For example, “at time X, we will add 50petabytes of capacity to the BigTable chunk store at the instance inChicago” or “at time X, the entire instance in southern India suddenlyfails.” The simulator runs the LAD against this simulated universe, andapplies the LAD operations back to the universe, producing variousgraphs and records of what would happen over time.

A LAD simulation system provides many advantages. One advantage is thatit allows testing of new algorithms for the LAD: developers can see theconsequences of new algorithms without having to actually find out inthe real world (which would be both dangerous and expensive). Anotheradvantage is that it allows for capacity planning: by feedingprojections for changes in system usage and underlying capacityavailability into the system, developers can see what the distributedstorage system will need over time, and thus plan capital equipmentacquisitions. Yet another advantage provided by LAD simulation is thatit facilitates disaster readiness: by simulating disaster events ofvarious sorts, developers can verify that the system will respondappropriately in those cases. If not, developers can modify the LADalgorithms so that the distributed storage system does respond well whenreal disasters occur. An additional advantage of a LAD simulation systemis to provide a near term view of the future. By continually running theLAD simulator against the “plan of record,” using statistical dataperiodically derived from the actual state of the world, developers canpredict how the distributed storage system will respond over a period ofweeks or months, and thus be aware of future events before they happen.

Some embodiments implement a micro-LAD that pays an important role fornewly created blobs. When a blob has finished writing (i.e., when it ismarked as finalized), the LAD algorithm is immediately run at theblobmaster where the blob was created. This execution of the LADalgorithm is allowed to create new replicas, but isn't allowed to removeany replicas. In some embodiments, the micro-LAD executes only for thenewly created blob; in other embodiments, the micro-LAD executes for allblobs stored at the instance where the new blob was created. In general,one or more additional replicas of the blob will be needed to reach thepolicy goal, so creating additional replicas immediately is important.Until the new replicas are made, the blob is vulnerable to becomingunavailable if the instance becomes unavailable, or even lost if theinstance is suddenly destroyed. This immediate micro-LAD run bypassesthe usual wait time for a whole cycle of the LAD to complete.

Tape Backup

Embodiments of the disclosed distributed storage system implement anovel approach to tape backup. Unlike most databases, which use aseparate scan and backup system, embodiments of the present inventiontreat tape as simply another storage type. In some embodiments, tapescreate multiple storage types, such as tapes that are kept in the tapelibrary versus tapes that are carted off to a vault somewhere. In someembodiments, the difference between tape stores and other data stores isthat, because tape is so slow, one is not allowed to directly read fromor write to a tape store. In these embodiments, one may only replicateto and from a tape store, which is typically implemented as a backgroundoperation. Embodiments of the present invention include a tape managermodule that manages a large tape buffer. The tape buffer acts as astaging area for data going to or from tape. In some embodiments,implementations of the tape manager allow a client to read or write totape. Because tape operations are very slow, client read and writeoperations will typically be directed to other data stores, even whentape is directly available. Therefore, either by design or by thepractical considerations, reading and writing to tape generally does nothappen in a real-time way.

Conceptually, backups are therefore driven by blob replication policies.For example, a user or user application may specify the policy “2 copieson-disk, and one copy on-tape, in three different cities.” This is atypical policy a user might choose. Multiple copies on disk give bothincreased data integrity, in case a single copy fails, as well asincreased availability, in case one replica is at an instance that istemporarily unavailable. By having the replicas at distinct locations,it can also provide faster access to a greater number of users near eachreplica. Tape copies improve data integrity but not availability. On theother hand, tape copies are considerably cheaper. The multiple-cityrequirement in this example policy provides protection against eventssuch as blackouts, which can disrupt multiple instances at the sametime. LAD replication will write a copy to tape in some appropriatelocation.

Blob policies effectively address what will happen at a distributedstorage system when a catastrophic event occurs (such as failure of aninstance). An operator indicates to the system that all chunk stores atthis instance are now invalid. In some embodiments, an operator doesthis by updating a central configuration file. When the LAD nextexamines a blob with replicas at the instance marked as invalid, the LADwill discover that the blob is now under-policy: one of its replicas hasgone away. The LAD therefore triggers a new replication to restoreequilibrium. The cost of reading from tape is generally higher than thecost of reading from disk, because it often involves physically pickingup tapes and moving them to a tape drive device. Therefore, the LAD willgenerally choose to create a new replica from a surviving on-diskreplica. However, there may be no such replica. For example, the policymay specify one replica on-disk and one replica on-tape, or the blob maynot have been fully replicated yet. In these cases, the LAD willinitiate replication from tape.

Another kind of catastrophic event involves overwriting or deleting ablob due to operator error or malice. In this case, an operator canrecover the old version by manually requesting that an earliergeneration of the same blob be replicated to various locations, and thatthe new (bad) generations be deleted. In some embodiments, this isimplemented by calls to ReplicateBlob( ).

In order to handle these sorts of catastrophic events, some embodimentsimplement tape as a type of chunk store. At a very high level, thisworks by maintaining a staging area. In some embodiments, the stagingarea is a set of files on an ordinary distributed file system. In otherembodiments, the staging area is a large memory buffer, which maycomprise either volatile or non-volatile storage. Blobs going to or fromtape are written to designated locations in this staging area. A tapemaster module monitors the staging area, and assigns blobs to batches,which are then committed to the underlying tape system using appropriatecommands. Tape storage is described in more detail in co-pendingapplication U.S. Provisional Patent Application Ser. No. 61/302,909,“Method and System for Providing Efficient Access to a Tape StorageSystem,” filed Feb. 9, 2010, which is incorporated herein by referencein its entirety.

Accounting

It is important to keep track of how much storage and network trafficeach user uses for a number of reasons: billing, capacity planning,usage quotas, etc. A usage quota specifies a maximum allowed usage of aresource for a user. This is important so that an ordinary user does notuse up a disproportionately large percentage of the disk space ornetwork bandwidth, which can adversely affect other users, includingother users with higher priority tasks. In some embodiments, quotas arestored in a set of quota servers distinct from the blob and metadatastorage at an instance. Quota servers essentially store a set of [tag,usage] pairs, which allow easy look up, and produce logs for auditingpurposes. Some embodiments use the following keys for accounting:username, chunk store name, and storage mode. In some embodiments,storage usage is specified as a numbers of bytes.

Some embodiments include four or more storage modes, including

-   -   TOTAL: All bytes owned by a given user in a given chunk store.    -   HYLIC: Bytes in chunks that have been written via the bitpusher,        but not yet attached to any blob via a metadata update.    -   LIVE: Bytes in chunks that belong to a blob for which the user        owns at least one reference (and the reference has not yet been        marked as deleted).    -   ZOMBIE: Bytes in chunks that belong to a blob for which all of        the user's references have been deleted, although the blob        itself has not yet vanished.

As these exemplary storage modes illustrate, the storage modes are notnecessarily mutually exclusive. For example, LIVE bytes and ZOMBIE bytesare mutually exclusive, but both of these are included in the TOTALbytes.

These byte counters are incremented by bitpushers when chunks arecreated or destroyed, and by the blobmaster at delta compaction time.However, managing these transitions (such as bytes going from HYLIC toLIVE) is surprisingly complicated. To achieve accurate accounting ofbytes, some embodiments have the blobmaster maintain a state machine,which tracks blob states at each chunk store for each user who owns areference to that blob and the chunk store. That is, there is a stateassigned to each triple (blob, user, chunk store). The statesessentially track the stages of a blob, from early creation to eventualdeletion. Within this lifespan, some embodiments identify four states:

-   -   HYLIC: A blob is “hylic” for a user if the chunks have been        written to a bitpusher under ownership of that particular user,        but the chunks have not yet been attached to any blob. In this        inchoate state, the blob is not accessible to anyone.    -   LIVE: A blob is “live” for a user if that user has at least one        reference that does not have a tombstone on it.    -   ZOMBIE: A blob is a “zombie” for a user if that user has at        least one reference, but all of the user's references have        tombstones.    -   DEAD: A blob is “dead” for that user if the user owns no        references to that blob, and/or that blob does not exist in the        given chunk store. This is the default state.

In summary, for each (blob, user, chunk store) triple, some embodimentstrack both the state of the blob and the number of bytes that the blobuses in the chunk store. Note that two replicas of the same blob may usedifferent numbers of bytes. For example, some embodiments count byteusage according to the block sizes used in the chunk stores. Forexample, a file system chunk store may implement 4K blocks, so each blobwould use an integer number of these blocks.

Every event in the life of a blob can be considered as moving the blobbetween the four states identified above, and transitions between thesefour states correspond to changes in the usage for the four storagemodes. The first two storage usage rules depend on the original state ofthe blob, and the last two storage usage rules depend on the new stateof the blob. For each transition, two of the following transition ruleswill apply:

-   -   If a number of bytes are moving (for a particular user and chunk        store) from the DEAD state to any other state, the TOTAL usage        is incremented by that number.    -   If a number of bytes are moving from any state other than DEAD,        the usage for that storage mode is decremented by that number of        bytes. For example, in a transition from a HYLIC state to a LIVE        state, the HYLIC usage is decremented.    -   If a number of bytes are moving (for a particular user and chunk        store) to the DEAD state, the TOTAL usage is decremented by that        number.    -   If a number of bytes are moving to any state other than DEAD,        the usage for that storage mode is incremented by that number of        bytes. For example, in a transition from a HYLIC state to a LIVE        state, the LIVE usage is incremented.

The following sample sequence of events illustrates the accountingprocess. User Jim writes a 100-byte chunk to the bitpusher. Thebitpusher increments the usage for (Jim, chunkstore, HYLIC) and (Jim,chunkstore, TOTAL) by 100. Jim then calls FlushMetadata( ) on his blobwriter object, causing those chunks to be added to a blob. Theblobmaster records that the size of this blob in the particular chunkstore has increased by 100, and notes that 100 bytes have moved from(Jim, chunkstore, HYLIC) to (Jim, chunkstore, LIVE). Under unusualcircumstances, the 100 bytes would instead be moved to (Jim, chunkstore,ZOMBIE) if Jim's reference(s) to this blob all have tombstones. If anyother users have references to this blob, then for each of them, 100bytes move from (username, chunkstore, DEAD) to (username, chunkstore,LIVE). Just like Jim, these could be in the ZOMBIE storage modedepending on the state of each user's references.

Continuing the sample sequence of events, assume someone adds areference to a blob. This results in incrementing the count of theuser's total references to the blob and the user's count of livereferences to the blob. The user's usage transitions from (username,chunkstore, state) to (username, chunkstore, new state). The state maychange between LIVE and ZOMBIE. Similar things happen when a referenceis removed, or when a tombstone expires. If a replica is removed from aninstance, then for all users who have references to that blob, the statechanges from whatever it was to DEAD.

The above explanation demonstrates how embodiments of the presentinvention naturally express every event in the life of a blob in termsof the four primitive modes. The values for those four modes are savedin a log and subsequently used to compute usage over time in each chunkstore for each user. This is used to produce billing information. Insome embodiments, billing information depends only on TOTAL bytes.Monitoring of hylic, live and zombie bytes is important so that userscan see where their bill is coming from. An anomalously high fraction ofhylic or zombie bytes could indicate a problem. In some embodiments,different billing rates apple to different storage modes. For example,the billing rate for LIVE storage may be higher than storage of HYLIC orZOMBIE storage.

Some embodiments of the disclosed distributed storage system track otherinformation in addition to blob storage usage. For example, someembodiments track counts of read and write operations to each chunkstore, and usage of each network link by the user. These items translatedirectly to billing in a natural way, but generally do not involveanything as complicated as the complicated state transitions outlinedabove.

Logging

Some embodiments log events in the life of a blob for debugging andauditing purposes. In some embodiments, the log is structured as adatabase, which may be implemented in a BigTable or a relationaldatabase. In a BigTable implementation, the key for each row is a blobID, and the value is simply the sequence of every metadata delta thathas been applied to this blob. In some embodiments, certain informationstripped out to limit the size, and to prevent blob contents from beinginadvertently revealed by the log data.

A “life of a blob” server is an exemplary front-end for this event log.The server may be queried for any particular blob ID by an authorizeduser, who can then see the full history of all mutations. Thisinformation can be used for debugging purposes. Additionally, metadatadeltas may have human-readable annotations indicating the author andpurpose. Certain metadata changes require such annotations, such assetting of the “administrative bits” in a blob. These are flags that maybe used for legal purposes. Two exemplary administrative bits are:

-   -   “blocked:” When this bit is set, the system does not return the        contents of this blob to any user, except for designated        superusers. However, the blob itself is not actually deleted.        One intended purpose of this flag is to respond to legally        imposed takedown orders. Even when taken down, there may be        reasons not to actually discard the copy.    -   “preserved:” When this bit is set, the system will not delete        the specified generation unless the entire blob is deleted. If        this is set in combination with the caller adding a new        reference, the blob contents will always be preserved by the        system until the reference is released or the flag is cleared.        One intended purpose of this flag is to respond to legal        preservation orders.

Coprocessors

Because network links are expensive, sometimes it is useful for a userto be able to execute a function to transform a replica of a blob closeto where that blob is stored, and transmit the transformed copy to theuser. For example, a user may store digital images in embodiments of thedistributed storage system, and want to generate small thumbnail images.Rather than shipping a large image across the planet and then computingthe thumbnail, it is more efficient to first compute a thumbnail andship just the thumbnail to the destination. Some embodiments of thedisclosed distributed storage system include coprocessor functionalityto implement these types of transformations.

In some embodiments, coprocessors are programs initiated by a user ofthe system that execute within the distributed storage system. Someembodiments expose a network RPC API (Remote Procedure Call in anApplication Programmer Interface), which may be accessed by a loadbalancing system under some particular service name. Some embodimentsextend the “read” function in the interface to take as arguments theload-balancer name of a coprocessor, and the name of the RPC to becalled. Such a “read” command requests that the given blob be passedthrough the given function call, with the transformed blob returned tothe user.

In some embodiments, the client receives the metadata from theblobmaster as usual, and when the client requests the actual contents ofthe blob from the bitpusher, the bitpusher initiates the transformation.That is, the bitpusher reads the contents of the blob, calls a nearbycoprocessing server (via a load-balancer call) to perform the indicatedRPC, and returns that result to the user rather than the blob contents.In the case of inline blobs, the blobmaster does this instead of thebitpusher.

In other embodiments, the “derived blobs” created from the coprocessorcall are cached and ultimately saved as part of the original blob. Forexample, if the blobmaster received a request for the output of passingblob X through the thumbnailer, it could look at the replicas and say,“there is an unthumbnailed copy close by, or a thumbnail copy a bitfarther away,” and make an efficient decision of whether it would bemore efficient to re-run the coprocessor or to fetch remote data. Thisis similar to decisions about serving data from a remote instance versusperforming a real-time replication to a closer instance.

“Representations” of blobs support this concept of creating derivedblobs from an original blob. When a coprocessor call is made, if thiscall has been marked as cacheable, the bitpusher will write out theresults of the coprocessor call as chunks to one of its own local datastores, and then inform the blobmaster that it has created a newrepresentation of the blob. In some embodiments, the representation IDof the derived blob is the name of the coprocessor call and the set ofarguments that were passed to the coprocessor call. This representationis considered to be a part of the generation from which the derived blobwas created. Representations created as derived blobs are replicated andpropagated in the usual way.

Saving copies of transformed blobs provides for on-demand performance ofpotentially expensive operations. In the thumbnailing example, it meansthat one does not need to precompute thumbnails for every image in orderto have them quickly available. Once an image is thumbnailed, thethumbnail persists in the blob store, and future reads can access it.This is especially important for operations that are expensive in bothcomputation and storage, such as conversion of file formats.

The Stable Clock System

In some embodiments, the timestamps used to construct sequenceidentifiers just read the time from the computer's clock. However,computer clocks are imperfect for several reasons. First, many computerclocks do not track time with sufficient accuracy. Second, computerclocks sometimes jump forward or backward for unknown reasons.Implementations of the disclosed distributed storage systems require aclock that is both accurate and guaranteed to be monotonicallyincreasing, so some embodiments implement a stable clock system.

In a stable clock system, the timestamps need to be monotonicallyincreasing. Specifically, within the lifetime of a single UNIX process,successive sequence identifiers need to be increasing. In someembodiments, this is implemented by running a simple monotonic clock ontop of an underlying clock. This guarantees strictly increasing sequenceidentifiers within a single process (e.g., blobmaster task), but doesnot guarantee that sequence identifiers issued by different tasks willappear in the right order. In particular, if two successive operationsare routed to different blobmaster tasks (e.g., due to load-balancing),they may be issued sequence identifiers that are out of order becausethe internal time clocks are different. This is contrary to userexpectations, and can lead to unexplainable results.

Some embodiments avoid this problem by functionality in the clientlibrary. Whenever the client library receives a response from ablobmaster about any operation that issued a new sequence identifier,the client library stores that sequence identifier in memory. When theclient library sends future requests, it attaches that sequenceidentifier to the call, so that any new sequence identifiers issued aregreater than that one. This solves the ordering problem, but introducesanother one. Since each blobmaster's clock must be monotonic, ablobmaster may have to manually advance its own clock by some amount inorder to generate a sequence identifier that is greater than the onepassed from the client library. If a client were to send a malformedrequest, it could corrupt the entire state of the blobmaster, pushingits internal clock into the distant future.

Some embodiments avoid this new problem, by placing a limit on how farforward the clock can be manually adjusted. If the timestamp portion ofthe sequence identifier passed from the client is too far in the future(e.g., a gap of more than a minute), the blobmaster assumes that theclock value is bogus, and returns an error. However, this creates yetanother problem: there is no obvious remediation that a client can do inresponse to these errors. At best, the system can make sure that theseerrors are rare and meaningful.

There are two things that could cause this type of irremediable error.One source of the error is a problem on the client side that sent abogus sequence identifier value. This could be due to a bug in theclient library, or memory corruption in the client application, whichresulted in overwriting the real value. There is no way to avert theseproblems with certainty, but eliminating all other sources of the issuewould help to identify these potential problems as the source. The otherclass of problem is that another blobmaster issued a bogus sequenceidentifier, far in the future, and the client would then (correctly)propagate a bad value everywhere so that all blobmasters had theincorrect future time.

This problem can be averted by ensuring that no blobmaster's clocksuddenly jumps forward. Unfortunately, this can easily happen with amachine's system clock for a variety of reasons, including NTP (networktime protocol) updates, or sporadic hardware failures. Therefore, toavoid these problems, some embodiments do not use the machine's systemclock as the underlying clock for sequence generation. Instead, theseembodiments use a stable clock system.

Some embodiments of a stable clock system comprise three layers, asillustrated in FIG. 26. At the bottom are the quorum clock servers2606-1 to 2606-5, which are servers that simply report their own machinesystem time in response to a query. The middle layer is the “reliableclock” 2604, which is a software library running on the blobmaster thatdetermines the current time by querying a number of clock servers, andverifying that they agree about what time it is to within some specifiedprecision. In some embodiments, the precision is specified as a smallnumber of milliseconds. In other embodiments, precision had a predefinedvalue in the range of 100-1000 microseconds. If the clock servers 2606do not agree, the reliable clock server 2604 re-polls the clocks. Ifthere still isn't a reasonable quorum after a few tries, the reliableclock 2604 alerts a human that something has gone seriously wrong. Thereliable clock server 2604 is not prone to skewing, but it responds torequests relatively slowly. The reliable clock 2604 calls up each of theclock servers to determine a quorum, rather than reading a singlehardware register on a local computer. Therefore, some embodimentsinclude a third layer on top of this, which is sometimes referred to asthe “cached clock” server 2602. The cached clock server 2602periodically queries the reliable clock server 2604, and uses the timefrom the reliable clock server 2604 to calibrate its own machine systemclock. That is, whenever the cached clock server 2602 gets a result fromthe reliable clock server 2604, the cached clock server 2602 redefines“now” to be the value it received. At any time in the future, the systemclock on the cached clock server 2602 will be the elapsed time asmeasured by the machine system clock, plus that reliable clock value. Ifthe elapsed time as measured by the system clock ever exceeds the timeinterval between reliable clock retests, the cached clock server 2602instead rechecks the reliable clock server 2604. In this way, if themachine clock on the cached clock server 2602 does suddenly skewforward, it will trigger a recheck of the reliable clock server 2604,rather than allowing a bogus timestamp to be returned to the caller.

Detailed Description of Some Embodiments

The present specification describes a distributed storage system. Insome embodiments, as illustrated in FIG. 1A, the distributed storagesystem is implemented on a global or planet-scale. In these embodiments,there are a plurality of instances 102-1, 102-2, . . . 102-N at variouslocations on the Earth 100, connected by network communication links104-1, 104-2, . . . 104-M. In some embodiments, an instance (such asinstance 102-1) corresponds to a data center. In other embodiments,multiple instances are physically located at the same data center.Although the conceptual diagram of FIG. 1 shows a limited number ofnetwork communication links 104-1, etc., typical embodiments would havemany more network communication links. In some embodiments, there aretwo or more network communication links between the same pair ofinstances, as illustrated by links 104-5 and 104-6 between instance 2(102-2) and instance 6 (102-6). In some embodiments, the networkcommunication links are composed of fiber optic cable. In someembodiments, some of the network communication links use wirelesstechnology, such as microwaves. In some embodiments, each networkcommunication link has a specified bandwidth and/or a specified cost forthe use of that bandwidth. In some embodiments, statistics aremaintained about the transfer of data across one or more of the networkcommunication links, including throughput rate, times of availability,reliability of the links, etc. Each instance typically has data storesand associated databases (as shown in FIGS. 2 and 3), and utilizes afarm of server computers (“instance servers,” see FIG. 4) to perform allof the tasks. In some embodiments, there are one or more instances thathave limited functionality, such as acting as a repeater for datatransmissions between other instances. Limited functionality instancesmay or may not have any of the data stores depicted in FIGS. 3 and 4.

FIG. 1B illustrates data and programs at an instance 102-i that storeand replicate data between instances. The underlying data items 122-1,122-2, etc. are stored and managed by one or more database units 120.Each instance 102-i has a replication unit 124 that replicates data toand from other instances. The replication unit 124 also manages one ormore egress maps 134 that track data sent to and acknowledged by otherinstances. Similarly, the replication unit 124 manages one or moreingress maps, which track data received at the instance from otherinstances. Egress maps and ingress maps are described in more detailbelow with respect to FIGS. 14A-14D, 15A, and 17.

Each instance 102-i has one or more clock servers 126 that provideaccurate time. In some embodiments, the clock servers 126 provide timeas the number of microseconds past a well-defined point in the past. Insome embodiments, the clock servers provide time readings that areguaranteed to be monotonically increasing. In some embodiments, eachinstance server 102-i stores an instance identifier 128 that uniquelyidentifies itself within the distributed storage system. The instanceidentifier may be saved in any convenient format, such as a 32-bitinteger, a 64-bit integer, or a fixed length character string. In someembodiments, the instance identifier is incorporated (directly orindirectly) into other unique identifiers generated at the instance. Insome embodiments, an instance 102-i stores a row identifier seed 130,which is used when new data items 122 are inserted into the database. Arow identifier is used to uniquely identify each data item 122. In someembodiments, the row identifier seed is used to create a row identifier,and simultaneously incremented, so that the next row identifier will begreater. In other embodiments, unique row identifiers are created from atimestamp provided by the clock servers 126, without the use of a rowidentifier seed. In some embodiments, a tie breaker value 132 is usedwhen generating row identifiers or unique identifiers for data changes(described below with respect to FIGS. 6-9). In some embodiments, a tiebreaker 132 is stored permanently in non-volatile memory (such as amagnetic or optical disk).

The elements described in FIG. 1B are incorporated in embodiments of thedistributed storage system 200 illustrated in FIGS. 2 and 3. In someembodiments, the functionality described in FIG. 1B is included in ablobmaster 204 and metadata store 206. In these embodiments, the primarydata storage (i.e., blobs) is in the data stores 212, 214, 216, 218, and220, and managed by bitpushers 210. The metadata for the blobs is in themetadata store 206, and managed by the blobmaster 204. The metadatacorresponds to the functionality identified in FIG. 1B. Although themetadata for storage of blobs provides an exemplary embodiment of thepresent invention, one of ordinary skill in the art would recognize thatthe present invention is not limited to this embodiment.

In some embodiments the disclosed distributed storage system 200, thedistributed storage system is used by one or more user applications 308,which are provided by application servers, such as 150-1, 150-2, 150-3,150-4, and 150-5 illustrated in FIGS. 1C-1G. Exemplary user applicationsthat use embodiments of the disclosed distributed storage system includeGmail, YouTube, Orkutt, Google Docs, and Picassa. Some embodiments ofthe disclosed distributed storage system simultaneously provide storagefor multiple distinct user applications, and impose no limit on thenumber of distinct user applications that can use the distributedstorage system. For example, a single implementation of the discloseddistributed storage system may provide storage services for all of theexemplary user applications listed above. In some embodiments, a userapplication 308 runs in a web browser 306, on a user computer system304. A user 302 interacts with a user application 308 according to theinterface provided by the user application. Each user application 308uses a client library 310 to store and retrieve data from thedistributed storage system 200.

FIG. 1C illustrates an embodiment in which a user application isprovided by one or more application servers 150-1. In some embodiments,the web browser 306 downloads user application 308 over a network 328from the application servers 150-1. In addition to communication betweenthe application server 150-1 and the user system 304, the applicationserver(s) 150-1 communicate over network 328 with the distributedstorage system 200. In particular, the application servers may establishstorage policies 326 that are applicable to all data stored by thesupplied user application. For example, administrators of the GmailApplication servers may establish storage policies 326 that areapplicable to millions of user of Gmail.

In some embodiments, communication between the client library 310 andthe distributed storage system utilizes a load balancer 314, which candistribute user requests to various instances within the distributedstorage system based on various conditions, such as network traffic andusage levels at each instance. In the embodiment illustrated in FIG. 1C,the load balancer 314 is not an integrated component of the distributedstorage system 200. The load balancer 314 communicates with both theclient library 310 and the distributed storage system 200 over one ormore networks 328. The network 328 may include the Internet, one or morelocal area networks (LANs), one or more wide are networks (WANs), one ormore wireless networks (WiFi networks), or various combinations ofthese.

FIG. 1D illustrates an embodiment that is similar to FIG. 1C, exceptthat the load balancing system 314 just returns information to theclient library 310 to specify which instance 102 within the distributedstorage system 200 should be contacted. The client library 310 thencontacts the appropriate instance 102 directly.

FIG. 1E illustrates an embodiment that is similar to FIG. 1C, exceptthat the load balancing system 314 is an integrated part of thedistributed storage application 200. In some embodiments, load balancers314 are included at some or all of the instances within the distributedstorage system 200. Even in these embodiments, a load balancer 314 maydirect the communication to a different instance.

FIG. 1F illustrates an embodiment that is similar to FIG. 1C, exceptthat the load balancing service 314 is included in the applicationservers 150-4. This embodiment is more commonly used when thedistributed storage system 200 is being used by a single userapplication provided by the application servers 150-4. In this case, theload balancer 314 has a complete picture of the load because theapplication servers 150-4 receive all of the traffic directed to thedistributed storage system.

FIG. 1G illustrates a variation of FIG. 1F, in which the client library310 is maintained at the application servers 150-5 rather thanintegrated within the running user application 308.

The distributed storage system 200 shown in FIGS. 2 and 3 includescertain global applications and configuration information 202, as wellas a plurality of instances 102-1, . . . 102-N. In some embodiments, theglobal configuration information includes a list of instances andinformation about each instance. In some embodiments, the informationfor each instance includes: the set of storage nodes (data stores) atthe instance; the state information, which in some embodiments includeswhether the metadata at the instance is global or local; and networkaddresses to reach the blobmaster 204 and bitpusher 210 at the instance.In some embodiments, the global configuration information 202 resides ata single physical location, and that information is retrieved as needed.In other embodiments, copies of the global configuration information 202are stored at multiple locations. In some embodiments, copies of theglobal configuration information 202 are stored at some or all of theinstances. In some embodiments, the global configuration information canonly be modified at a single location, and changes are transferred toother locations by one-way replication. In some embodiments, there arecertain global applications, such as the location assignment daemon 346(see FIG. 3) that can only run at one location at any given time. Insome embodiments, the global applications run at a selected instance,but in other embodiments, one or more of the global applications runs ona set of servers distinct from the instances. In some embodiments, thelocation where a global application is running is specified as part ofthe global configuration information 202, and is subject to change overtime.

FIGS. 2 and 3 illustrate an exemplary set of programs, processes, anddata that run or exist at each instance, as well as a user system thatmay access the distributed storage system 200 and some globalapplications and configuration. In some embodiments, a user 302interacts with a user system 304, which may be a computer or otherdevice that can run a web browser 306. A user application 308 runs inthe web browser, and uses functionality provided by database client 310to access data stored in the distributed storage system 200 usingnetwork 328. Network 328 may be the Internet, a local area network(LAN), a wide area network (WAN), a wireless network (WiFi), a localintranet, or any combination of these. In some embodiments, a loadbalancer 314 distributes the workload among the instances, so multiplerequests issued by a single client 310 need not all go to the sameinstance. In some embodiments, database client 310 uses information in aglobal configuration store 312 to identify an appropriate instance for arequest. The client uses information from the global configuration store312 to find the set of blobmasters 204 and bitpushers 210 that areavailable, and where to contact them. A blobmaster 204 uses a globalconfiguration store 312 to identify the set of peers for all of thereplication processes. A bitpusher 210 uses information in a globalconfiguration store 312 to track which stores it is responsible for. Insome embodiments, user application 308 runs on the user system 304without a web browser 306. Exemplary user applications are an emailapplication and an online video application.

In some embodiments, each instance has a blobmaster 204, which is aprogram that acts as an external interface to the metadata table 206.For example, an external user application 308 can request metadatacorresponding to a specified blob using client 310. In some embodiments,every instance 102 has metadata in its metadata table 206 correspondingto every blob stored anywhere in the distributed storage system 200. Inother embodiments, the instances come in two varieties: those withglobal metadata (for every blob in the distributed storage system 200)and those with only local metadata (only for blobs that are stored atthe instance). In particular, blobs typically reside at only a smallsubset of the instances. The metadata table 206 includes informationrelevant to each of the blobs, such as which instances have copies of ablob, who has access to a blob, and what type of data store is used ateach instance to store a blob. The exemplary data structures in FIGS.18A-18E illustrate other metadata that is stored in metadata table 206in some embodiments.

When a client 310 wants to read a blob of data, the blobmaster 204provides one or more read tokens to the client 310, which the client 310provides to a bitpusher 210 in order to gain access to the relevantblob. When a client 310 writes data, the client 310 writes to abitpusher 210. The bitpusher 210 returns write tokens indicating thatdata has been stored, which the client 310 then provides to theblobmaster 204, in order to attach that data to a blob. A client 310communicates with a bitpusher 210 over network 328, which may be thesame network used to communicate with the blobmaster 204. In someembodiments, communication between the client 310 and bitpushers 210 isrouted according to a load balancer 314. Because of load balancing orother factors, communication with a blobmaster 204 at one instance maybe followed by communication with a bitpusher 210 at a differentinstance. For example, the first instance may be a global instance withmetadata for all of the blobs, but may not have a copy of the desiredblob. The metadata for the blobs identifies which instances have copiesof the desired blob, so the subsequent communication with a bitpusher210 to read or write is at a different instance.

A bitpusher 210 copies data to and from data stores. In someembodiments, the read and write operations comprise entire blobs. Inother embodiments, each blob comprises one or more chunks, and the readand write operations performed by a bitpusher are on solely on chunks.In some of these embodiments, a bitpusher deals only with chunks, andhas no knowledge of blobs. In some embodiments, a bitpusher has noknowledge of the contents of the data that is read or written, and doesnot attempt to interpret the contents. Embodiments of a bitpusher 210support one or more types of data store. In some embodiments, abitpusher supports a plurality of data store types, including inlinedata stores 212, BigTable stores 214, file server stores 216, and tapestores 218. Some embodiments support additional other stores 220, or aredesigned to accommodate other types of data stores as they becomeavailable or technologically feasible.

Inline stores 212 actually use storage space 208 in the metadata store206. Inline stores provide faster access to the data, but have limitedcapacity, so inline stores are generally for relatively “small” blobs.In some embodiments, inline stores are limited to blobs that are storedas a single chunk. In some embodiments, “small” means blobs that areless than 32 kilobytes. In some embodiments, “small” means blobs thatare less than 1 megabyte. As storage technology facilitates greaterstorage capacity, even blobs that are currently considered large may be“relatively small” compared to other blobs.

BigTable stores 214 store data in BigTables located on one or moreBigTable database servers 316. BigTables are described in severalpublicly available publications, including “Bigtable: A DistributedStorage System for Structured Data,” Fay Chang et al, OSDI 2006, whichis incorporated herein by reference in its entirety. In someembodiments, the BigTable stores save data on a large array of servers316.

File stores 216 store data on one or more file servers 318. In someembodiments, the file servers use file systems provided by computeroperating systems, such as UNIX. In other embodiments, the file servers318 implement a proprietary file system, such as the Google File System(GFS). GFS is described in multiple publicly available publications,including “The Google File System,” Sanjay Ghemawat et al., SOSP'03,Oct. 19-22, 2003, which is incorporated herein by reference in itsentirety. In other embodiments, the file servers 318 implement NFS(Network File System) or other publicly available file systems notimplemented by a computer operating system. In some embodiments, thefile system is distributed across many individual servers 318 to reducerisk of loss or unavailability of any individual computer.

Tape stores 218 store data on physical tapes 320. Unlike a tape backup,the tapes here are another form of storage. This is described in greaterdetail in co-pending U.S. Provisional Patent Application Ser. No.61/302,909, “Method and System for Providing Efficient Access to a TapeStorage System,” filed Feb. 9, 2010, which is incorporated herein byreference in its entirety. In some embodiments, a Tape Masterapplication 222 assists in reading and writing from tape. In someembodiments, there are two types of tape: those that are physicallyloaded in a tape device, so that the tapes can be robotically loaded;and those tapes that physically located in a vault or other offlinelocation, and require human action to mount the tapes on a tape device.In some instances, the tapes in the latter category are referred to asdeep storage or archived. In some embodiments, a large read/write bufferis used to manage reading and writing data to tape. In some embodiments,this buffer is managed by the tape master application 222. In someembodiments there are separate read buffers and write buffers. In someembodiments, a client 310 cannot directly read or write to a copy ofdata that is stored on tape. In these embodiments, a client must read acopy of the data from an alternative data source, even if the data mustbe transmitted over a greater distance.

In some embodiments, there are additional other stores 220 that storedata in other formats or using other devices or technology. In someembodiments, bitpushers 210 are designed to accommodate additionalstorage technologies as they become available.

Each of the data store types has specific characteristics that make themuseful for certain purposes. For example, inline stores provide fastaccess, but use up more expensive limited space. As another example,tape storage is very inexpensive, and provides secure long-term storage,but a client cannot directly read or write to tape. In some embodiments,data is automatically stored in specific data store types based onmatching the characteristics of the data to the characteristics of thedata stores. In some embodiments, users 302 who create files may specifythe type of data store to use. In other embodiments, the type of datastore to use is determined by the user application 308 that creates theblobs of data. In some embodiments, a combination of the above selectioncriteria is used. In some embodiments, each blob is assigned to astorage policy 326, and the storage policy specifies storage properties.A blob policy 326 may specify the number of copies of the blob to save,in what types of data stores the blob should be saved, locations wherethe copies should be saved, etc. For example, a policy may specify thatthere should be two copies on disk (Big Table stores or File Stores),one copy on tape, and all three copies at distinct metro locations. Insome embodiments, blob policies 326 are stored as part of the globalconfiguration and applications 202.

In some embodiments, each instance 102 has a quorum clock server 228,which comprises one or more servers with internal clocks. The order ofevents, including metadata deltas 608, is important, so maintenance of aconsistent time clock is important. A quorum clock server regularlypolls a plurality of independent clocks, and determines if they arereasonably consistent. If the clocks become inconsistent and it isunclear how to resolve the inconsistency, human intervention may berequired. The resolution of an inconsistency may depend on the number ofclocks used for the quorum and the nature of the inconsistency. Forexample, if there are five clocks, and only one is inconsistent with theother four, then the consensus of the four is almost certainly right.However, if each of the five clocks has a time that differssignificantly from the others, there would be no clear resolution.

In some embodiments, each instance has a replication module 224, whichidentifies blobs or chunks that will be replicated to other instances.In some embodiments, the replication module 224 may use one or morequeues 226-1, 226-2, . . . . Items to be replicated are placed in aqueue 226, and the items are replicated when resources are available. Insome embodiments, items in a replication queue 226 have assignedpriorities, and the highest priority items are replicated as bandwidthbecomes available. There are multiple ways that items can be added to areplication queue 226. In some embodiments, items are added toreplication queues 226 when blob or chunk data is created or modified.For example, if an end user 302 modifies a blob at instance 1, then themodification needs to be transmitted to all other instances that havecopies of the blob. In embodiments that have priorities in thereplication queues 226, replication items based on blob content changeshave a relatively high priority. In some embodiments, items are added tothe replication queues 226 based on a current user request for a blobthat is located at a distant instance. For example, if a user inCalifornia requests a blob that exists only at an instance in India, anitem may be inserted into a replication queue 226 to copy the blob fromthe instance in India to a local instance in California. That is, sincethe data has to be copied from the distant location anyway, it may beuseful to save the data at a local instance. These dynamic replicationrequests receive the highest priority because they are responding tocurrent user requests. The dynamic replication process is described inmore detail in co-pending U.S. Provisional Patent Application Ser. No.61/302,896, “Method and System for Dynamically Replicating Data Within aDistributed Storage System,” filed Feb. 9, 2010, incorporated herein byreference in its entirety.

In some embodiments, there is a background replication process thatcreates and deletes copies of blobs based on blob policies 326 and blobaccess data provided by a statistics server 324. The blob policiesspecify how many copies of a blob are desired, where the copies shouldreside, and in what types of data stores the data should be saved. Insome embodiments, a policy may specify additional properties, such asthe number of generations of a blob to save, or time frames for savingdifferent numbers of copies. E.g., save three copies for the first 30days after creation, then two copies thereafter. Using blob policies326, together with statistical information provided by the statisticsserver 324, a location assignment daemon 322 determines where to createnew copies of a blob and what copies may be deleted. When new copies areto be created, records are inserted into a replication queue 226, withthe lowest priority. The use of blob policies 326 and the operation of alocation assignment daemon 322 are described in more detail inco-pending U.S. Provisional Patent Application Ser. No. 61/302,936,“System and Method for managing Replicas of Objects in a DistributedStorage System,” filed Feb. 9, 2010, which is incorporated herein byreference in its entirety.

FIG. 4 is a block diagram illustrating an Instance Server 400 used foroperations identified in FIGS. 2 and 3 in accordance with someembodiments of the present invention. An Instance Server 400 typicallyincludes one or more processing units (CPU's) 402 for executing modules,programs and/or instructions stored in memory 414 and thereby performingprocessing operations; one or more network or other communicationsinterfaces 404; memory 414; and one or more communication buses 412 forinterconnecting these components. In some embodiments, an InstanceServer 400 includes a user interface 406 comprising a display device 408and one or more input devices 410. In some embodiments, memory 414includes high-speed random access memory, such as DRAM, SRAM, DDR RAM orother random access solid state memory devices. In some embodiments,memory 414 includes non-volatile memory, such as one or more magneticdisk storage devices, optical disk storage devices, flash memorydevices, or other non-volatile solid state storage devices. In someembodiments, memory 414 includes one or more storage devices remotelylocated from the CPU(s) 402. Memory 414, or alternately the non-volatilememory device(s) within memory 414, comprises a computer readablestorage medium. In some embodiments, memory 414 or the computer readablestorage medium of memory 414 stores the following programs, modules anddata structures, or a subset thereof:

-   -   an operating system 416 that includes procedures for handling        various basic system services and for performing hardware        dependent tasks;    -   a communications module 418 that is used for connecting an        Instance Server 400 to other Instance Servers or computers via        the one or more communication network interfaces 404 (wired or        wireless) and one or more communication networks 328, such as        the Internet, other wide area networks, local area networks,        metropolitan area networks, and so on;    -   one or more server applications 420, such as a blobmaster 204        that provides an external interface to the blob metadata; a        bitpusher 210 that provides access to read and write data from        data stores; a replication module 224 that copies data from one        instance to another; a quorum clock server 228 that provides a        stable clock; a location assignment daemon 322 that determines        where copies of a blob should be located; and other server        functionality as illustrated in FIGS. 2 and 3. As illustrated,        two or more server applications 422 and 424 may execute on the        same physical computer;    -   one or more database servers 426 that provides storage and        access to one or more databases 428. The databases 428 may        provide storage for metadata 206, replication queues 226, blob        policies 326, global configuration 312, the statistics used by        statistics server 324, as well as ancillary databases used by        any of the other functionality. Each database 428 has one or        more tables with data records 430. In some embodiments, some        databases include aggregate tables 432, such as the statistics        used by statistics server 324; and    -   one or more file servers 434 that provide access to read and        write files, such as file #1 (436) and file #2 (438). File        server functionality may be provided directly by an operating        system (e.g., UNIX or Linux), or by a software application, such        as the Google File System (GFS).

Each of the above identified elements may be stored in one or more ofthe previously mentioned memory devices, and corresponds to a set ofinstructions for performing a function described above. The aboveidentified modules or programs (i.e., sets of instructions) need not beimplemented as separate software programs, procedures or modules, andthus various subsets of these modules may be combined or otherwisere-arranged in various embodiments. In some embodiments, memory 414 maystore a subset of the modules and data structures identified above.Furthermore, memory 414 may store additional modules or data structuresnot described above.

Although FIG. 4 shows an instance server used for performing variousoperations or storing data as illustrated in FIGS. 2 and 3, FIG. 4 isintended more as functional description of the various features whichmay be present in a set of one or more computers rather than as astructural schematic of the embodiments described herein. In practice,and as recognized by those of ordinary skill in the art, items shownseparately could be combined and some items could be separated. Forexample, some items shown separately in FIG. 4 could be implemented onindividual computer systems and single items could be implemented by oneor more computer systems. The actual number of computers used toimplement each of the operations, databases, or file storage systems,and how features are allocated among them will vary from oneimplementation to another, and may depend in part on the amount of dataat each instance, the amount of data traffic that an instance musthandle during peak usage periods, as well as the amount of data trafficthat an instance must handle during average usage periods.

To provide faster responses to clients and to provide fault tolerance,each program or process that runs at an instance is generallydistributed among multiple computers. The number of instance servers 400assigned to each of the programs or processes can vary, and depends onthe workload. FIG. 5 provides exemplary information about a typicalnumber of instance servers 400 that are assigned to each of thefunctions. In some embodiments, each instance has about 10 instanceservers performing (502) as blobmasters. In some embodiments, eachinstance has about 100 instance servers performing (504) as bitpushers.In some embodiments, each instance has about 50 instance serversperforming (506) as BigTable servers. In some embodiments, each instancehas about 1000 instance servers performing (508) as file system servers.File system servers store data for file system stores 216 as well as theunderlying storage medium for BigTable stores 214. In some embodiments,each instance has about 10 instance servers performing (510) as tapeservers. In some embodiments, each instance has about 5 instance serversperforming (512) as tape masters. In some embodiments, each instance hasabout 10 instance servers performing (514) replication management, whichincludes both dynamic and background replication. In some embodiments,each instance has about 5 instance servers performing (516) as quorumclock servers.

FIG. 6 illustrates the storage of metadata data items 600 according tosome embodiments. Each data item 600 has a unique row identifier 602.Each data item 600 is a row 604 that has a base value 606 and zero ormore deltas 608-1, 608-2, . . . , 608-L. When there are no deltas, thenthe value of the data item 600 is the base value 606. When there aredeltas, the “value” of the data item 600 is computed by starting withthe base value 606 and applying the deltas 608-1, etc. in order to thebase value. A row thus has a single value, representing a single dataitem or entry. Although in some embodiments the deltas store the entirenew value, in some embodiments the deltas store as little data aspossible to identify the change. For example, metadata for a blobincludes specifying what instances have the blob as well as who hasaccess to the blob. If the blob is copied to an additional instance, themetadata delta only needs to specify that the blob is available at theadditional instance. The delta need not specify where the blob isalready located. The reading of metadata data items 600 is described inmore detail with respect to FIG. 13. As the number of deltas increases,the time to read data increases, so there is also a compaction process1200 described below in FIGS. 8 and 12A-12B. The compaction processmerges the deltas 608-1, etc. into the base value 606 to create a newbase value that incorporates the changes in the deltas.

Although the storage shown in FIG. 6 relates to metadata for blobs, thesame process is applicable to other non-relational databases, such ascolumnar databases, in which the data changes in specific ways. Forexample, an access control list may be implemented as a multi-byteinteger in which each bit position represents an item, location, orperson. Changing one piece of access information does not modify theother bits, so a delta to encode the change requires little space. Inalternative embodiments where the data is less structured, deltas may beencoded as instructions for how to make changes to a stream of binarydata. Some embodiments are described in publication RFC 3284, “TheVCDIFF Generic Differencing and Compression Data Format,” The InternetSociety, 2002. One of ordinary skill in the art would thus recognizethat the same technique applied here for metadata is equally applicableto certain other types of structured data.

FIG. 7 illustrates an exemplary data structure to hold a delta. Eachdelta applies to a unique row, so the delta includes the row identifier702 of the row to which it applies. In order to guarantee dataconsistency at multiple instances, the deltas must be applied in awell-defined order to the base value. The sequence identifier 704 isglobally unique, and specifies the order in which the deltas areapplied. In some embodiments, the sequence identifier comprises atimestamp 706 and a tie breaker value 708 that is uniquely assigned toeach instance where deltas are created. In some embodiments, thetimestamp is the number of microseconds past a well-defined point intime. In some embodiments, the tie breaker is computed as a function ofthe physical machine running the blobmaster as well as a process id. Insome embodiments, the tie breaker includes an instance identifier,either alone, or in conjunction with other characteristics at theinstance. In some embodiments, the tie breaker 708 is stored as a tiebreaker value 132. By combining the timestamp 706 and a tie breaker 708,the sequence identifier is both globally unique and at leastapproximately the order in which the deltas were created. In certaincircumstances, clocks at different instances may be slightly different,so the order defined by the sequence identifiers may not correspond tothe “actual” order of events. However, in some embodiments, the “order,”by definition, is the order created by the sequence identifiers. This isthe order the changes will be applied at all instances.

A change to metadata at one instance is replicated to other instances.The actual change to the base value 712 may be stored in variousformats. In some embodiments, data structures similar to those in FIGS.18A-18E are used to store the changes, but the structures are modifiedso that most of the fields are optional. Only the actual changes arefilled in, so the space required to store or transmit the delta issmall. In other embodiments, the changes are stored as key/value pairs,where the key uniquely identifies the data element changed, and thevalue is the new value for the data element.

In some embodiments where the data items are metadata for blobs, deltasmay include information about forwarding. Because blobs may bedynamically replicated between instances at any time, and the metadatamay be modified at any time as well, there are times that a new copy ofa blob does not initially have all of the associated metadata. In thesecases, the source of the new copy maintains a “forwarding address,” andtransmits deltas to the instance that has the new copy of the blob for acertain period of time (e.g., for a certain range of sequenceidentifiers).

FIG. 8 illustrates a compaction process that reduces the number ofdeltas. If compaction were not performed, the number of deltas wouldgrow without limit, taking up storage space and slowing down performancefor reading data. The idea is to apply the deltas to the base value,effectively merging the base values and the deltas into a single newbase value. However, because of the existence of multiple copies of thesame data at distinct instances, there are some constraints imposed onwhich deltas may be merged with the base value. In some embodiments, acompaction horizon is selected that specifies the upper limit on whichdeltas will be merged. In some embodiments, the compaction horizon isselected for a group of data items 600, although a compaction horizoncould be selected for an individual data item 600.

Before the compaction process begins, each data item 600 is a row 604Awith an original base value 606A, and a set of zero or more deltas608-1, etc. For a data item 600 with zero deltas, there is nothing tocompact. The data item 600 illustrated in FIG. 8 initially has fivedeltas 608-1 to 608-5. In the embodiment shown, the compaction horizon610 is somewhere between the sequence identifier of delta 4 (608-4) andthe sequence identifier of delta 5 (608-5). More specifically, FIG. 8depicts an example in which the sequence identifier of delta 4 is lessthan or equal to the compaction horizon 610, and the compaction horizonis strictly less than the sequence identifier of delta 5. Delta 1(608-1) through delta 4 (608-4) are applied to the base value 606A insequence, to produce a new base value 606B that has been merged with thedeltas. Delta 1 to delta 4 are then deleted from original row 604A,leaving the new row 604B with the merged base value 606B and a set withthe single delta 608-5. If the compaction horizon had included delta608-5, the new row 604B would not have included any deltas.

The compaction process is also described below in FIGS. 12A-12B, and 17.In particular, the discussion of FIGS. 12A and 12B includes examples ofwhy the compaction horizon may not include all of the deltas at aninstance (as illustrated in FIG. 8). Although shown here in the contextof a single data item 600, compaction is generally a batch processbecause of the very large quantities of data and the fact that deltasare generally transmitted between instances in batches.

FIG. 9 illustrates an exemplary process for replicating metadata fromone instance to another instance. Although the simple illustration inFIG. 9 shows only a single metadata data item 600 and a single delta608, the method is generally applied to much larger batches asillustrated below with respect to FIGS. 15A-15B.

The replication process described here applies to existing copies ofdata at multiple instances. When metadata at one instance changes, thechanges must be replicated to all other instances that have metadata forthe same underlying data. Co-pending application U.S. patent applicationSer. No. 12/703,167, “Method and System for Efficiently Replicating Datain Non-Relational Databases,” filed Feb. 9, 2010, describes a differentreplication process, where a new copy of data is replicated to a newinstance. In this latter instance, a complete copy of the metadata mustbe sent to the new instance, and any recent changes to the metadata mustget to the new instance as well.

The replication process effectively begins when a change to metadataoccurs (902) at one instance that will require replication to otherinstances. When the change (also known as a mutation) occurs, a delta iscreated (904) to specify the change. An exemplary format is illustratedin FIG. 7 and described above. In principle, the delta could bereplicated immediately, but deltas are generally transmitted in batchesas more fully illustrated in the exemplary processes shown in FIGS.15A-15B.

At some point, the replication process is initiated (906). In someembodiments, replication can be initiated manually. In otherembodiments, replication is a scheduled background process (e.g.,triggered at certain time intervals, certain times of the day, or whenthe workload is low). In some embodiments, replication runs continuouslyin the background. In some embodiments, every instance has metadata foreach of the blobs, regardless of whether the blobs are physically storedat the instance. In other embodiments, there are a limited number ofglobal instances that maintain metadata for all of the blobs, and agreater number of local instances that maintain metadata only for theblobs stored at the instance. For replication targets that are localinstances, the replication process determines (908) whether the metadataitem 600 resides at the replication target. In some embodiments, thereplication process determines all instances that require the changedmetadata.

For the target instances that have the metadata data item 600, thereplication process determines (910) whether the target instance hasreceived delta 608. In some embodiments, this determination uses anegress map 134, as shown in FIGS. 14A and 14B and described in moredetail in FIGS. 15A-15B. Based on the deltas to send, and which deltashave already been received at each target instance, the replicationprocess builds (912) a transmission matrix that specifies a group ofdeltas to transmit to each target instance. In some embodiments, thetransmission matrix is a two-dimensional shape (e.g., a rectangle) asillustrated in FIGS. 15A-15B. In other embodiments, the transmissionmatrix is a list or one-dimensional array. The replication process thentransmits (914) the selected deltas to each target instance.

At a target instance, the deltas are received (916) and each delta isinserted (918) into the set of deltas for the corresponding metadatadata item 600. In some embodiments, the replication process updates(920) an ingress map 136 to indicate that the delta (or batch of deltas)has been incorporated into the metadata at the target instance. Thereplication process at the target instance also sends an acknowledgementback to the sender to indicate that the deltas have been received andincorporated.

The original sender of the deltas receives (924) the acknowledgementfrom the target instance, and updates (926) an egress map 134. Byupdating the egress map, the same deltas will not be transmitted to thesame target again in the future. The updated egress map also enablescompaction of deltas, as explained in more detail with respect to FIGS.12A and 12B.

FIG. 10 is a block diagram illustrating a client computer system 304that is used by a user 302 to access data stored at an instance 102 inaccordance with some embodiments of the present invention. A clientcomputer system 304 typically includes one or more processing units(CPU's) 1002 for executing modules, programs and/or instructions storedin memory 1014 and thereby performing processing operations; one or morenetwork or other communications interfaces 1004; memory 1014; and one ormore communication buses 1012 for interconnecting these components. Thecommunication buses 1012 may include circuitry (sometimes called achipset) that interconnects and controls communications between systemcomponents. A client computer system 304 includes a user interface 1006comprising a display device 1008 and one or more input devices 1010(e.g., a keyboard and a mouse or other pointing device). In someembodiments, memory 1014 includes high-speed random access memory, suchas DRAM, SRAM, DDR RAM or other random access solid state memorydevices. In some embodiments, memory 1014 includes non-volatile memory,such as one or more magnetic disk storage devices, optical disk storagedevices, flash memory devices, or other non-volatile solid state storagedevices. Optionally, memory 1014 includes one or more storage devicesremotely located from the CPU(s) 1002. Memory 1014, or alternately thenon-volatile memory device(s) within memory 1014, comprises a computerreadable storage medium. In some embodiments, memory 1014 or thecomputer readable storage medium of memory 1014 stores the followingprograms, modules and data structures, or a subset thereof:

-   -   an operating system 1016 that includes procedures for handling        various basic system services and for performing hardware        dependent tasks;    -   a communications module 1018 that is used for connecting the        client computer system 304 to other computers via the one or        more communication network interfaces 1004 (wired or wireless)        and one or more communication networks 328, such as the        Internet, other wide area networks, local area networks,        metropolitan area networks, and so on; and    -   a web browser 306 (or other client application) that enables a        user to communicate over a network 328 (such as the Internet)        with remote computers. In some embodiments, the web browser 306        uses a JavaScript run-time module 1020 to perform some        functions.    -   one or more user applications 308 that provide specific        functionality. For example, user applications 308 may include an        email application 308-1 and/or an online video application        308-2.    -   one or more database clients, such as email database client        310-1 or video database client 310-2, that provide an API for        the data stored at instances 102 to user applications 308.

Each of the above identified elements may be stored in one or more ofthe previously mentioned memory devices, and corresponds to a set ofinstructions for performing a function described above. The aboveidentified modules or programs (i.e., sets of instructions) need not beimplemented as separate software programs, procedures or modules, andthus various subsets of these modules may be combined or otherwisere-arranged in various embodiments. In some embodiments, memory 1014 maystore a subset of the modules and data structures identified above.Furthermore, memory 1014 may store additional modules or data structuresnot described above.

Although FIG. 10 shows a client computer system 304 that may access datastored at an instance 102, FIG. 10 is intended more as functionaldescription of the various features which may be present in a set of oneor more computers rather than as a structural schematic of theembodiments described herein. In practice, and as recognized by those ofordinary skill in the art, items shown separately could be combined andsome items could be separated.

FIGS. 11A-11C provide a flowchart of an exemplary process 1100 forreplicating (1102) data between a plurality instances of a distributeddatabase. In one embodiment, the distributed database holds metadata fora distributed storage system. In some embodiments, each instance of thedistributed database is stored on one or more server computers, eachhaving memory and one or more processors (1104).

The replication process 1100 identifies (1106) a first instance of thedatabase at a first geographic location and identifies (1108) a secondinstance of the database at a second geographic location. In someembodiments, the second geographic location is distinct from the firstlocation (1110). In some embodiments, a third instance of the databaseis identified (1112) at a third geographic location, which is distinctfrom the first and second geographic locations. In some embodiments,there are four or more instances of the database. In some embodiments,two or more instances of the database reside at the same geographiclocation. One reason for having multiple instances at the samegeographic site is to provide for maintenance zones. In someembodiments, a single data center has multiple maintenance zones, andeach such zone comprises an instance in the distributed database system.In some embodiments, when an instance is going to be taken down formaintenance, the data is replicated to one or more other instancesbeforehand, which may be other instances at the same data center.

For example, there may be single instances of the database in Atlanta,Seattle, and Los Angeles, and two instances of the database in Boston.In some embodiments, there are instances of the database on everycontinent except Antarctica, and even some instances on islands. Thedisclosed distributed storage system imposes no limit on the number orlocation of instances.

To facilitate efficient replication, changes to the distributed databaseare tracked as deltas (1114). Each delta has a row identifier thatidentifies the piece of data modified (1116). Each delta also has asequence identifier that specifies the order in which the deltas areapplied to the data (1118). The sequence identifiers are globally uniquethroughout the distributed storage system, so there is no ambiguityabout the order in which the deltas are applied to the data. In someembodiments, the sequence identifier comprises (1120) a timestamp and aunique tie breaker value that is assigned based on hardware and/orsoftware at each instance. In some embodiments, the timestamp specifiesthe number of microseconds after a designated point of time in the past.In some embodiments, the tie breaker value is computed based on one ormore of the following values: an identifier of a physical machine at theinstance, such as a unique serial number or a network interface card(NIC) address; an instance identifier; a process id of a specificprocess running at the instance (e.g., a UNIX process ID assigned to thedatabase process). Because the tie-breaker is a unique value assigned toeach instance, the combination of a timestamp and the tie breakerprovides a sequence identifier based on time, but guaranteed to beunique.

The time clocks at each instance are not guaranteed to be synchronizedto the microsecond and thus the ordering defined by the sequenceidentifiers is not guaranteed to match exactly what happened. However,if two changes to the same metadata item 600 occur about the same timeat two distant locations on the globe (e.g., Los Angeles and Paris), theexact order is unimportant. Having a well-defined unique order that willbe applied to every instance of the database is the more relevant issue,and this is provided by sequence identifiers. Moreover, in embodimentsthat use a timestamp or something similar to create the sequenceidentifiers, the sequence identifiers are in the right time sequenceorder virtually all of the time because multiple changes to the samemetadata rarely occur at the same time at two distinct instances.

Each delta includes an instance identifier (1122) as well. Each instanceis responsible for pushing out its changes (i.e., deltas) to all of theother instances, so each instance must be able to recognize the deltasthat it created. In some embodiments, the instance identifier is savedas part of the data structure for each individual delta. In otherembodiments, the association between deltas and instances is storeddifferently. For example, deltas may include a bit flag that indicateswhich deltas were created at the current instance. In other embodiments,the instance identifier is not stored as a separate data element becauseit is stored as part of the sequence identifier, or can be readilyderived from the sequence identifier.

The replication process 1100 determines (1124) which deltas are to besent to the second instance using a second egress map 134 at the firstinstance, where the second egress map specifies which combinations ofrow identifier and sequence identifier have been acknowledged asreceived at the second instance. An egress map 134 can be stored in avariety of ways, as illustrated in FIGS. 14A and 14B. FIG. 14Billustrates a map that might be used if the egress map were stored in atypical database. In this example, each row represents a single deltathat is to be transmitted to a single destination. The destinationinstance 1412 specifies to what instance the delta has been (or will be)sent. The row identifier 1414 and sequence identifier 1416 specify therow identifier and sequence identifier of a delta. In some embodiments,presence of a row in this egress table indicates that the delta has beenacknowledged as received at the destination instance. In otherembodiments, there is an additional field, such as “acknowledged,” whichis updated when the deltas are acknowledged. In these embodiments, rowsmay be inserted into the egress table as soon as deltas are created, orprior to transmission of the deltas to destination instances. In someembodiments, there is a separate egress table for each destinationinstance, so the rows in each egress table do not need to specify adestination instance.

Although the egress table in FIG. 14B is conceptually simple, itconsumes considerable resources, both in time and disk space. In someembodiments, a structure similar to the one shown in FIG. 14A may beused. In the egress table 134 shown in FIG. 14A, each record specifies atwo dimensional rectangle of deltas. In one dimension, the start row1404 and end row 1406 specify the beginning and ending of a range of rowidentifiers. In a second dimension, the start sequence 1408 and endsequence 1410 specify the beginning and ending of a range of sequenceidentifiers. Although this two dimensional region could theoreticallycontain a very large number of deltas, this two dimensional region isactually sparse for three reasons. First, within the continuous range ofrow identifiers, few of the rows will actually have any changes. Second,very few of the potential sequence identifiers within the range areactually used. For example, an exemplary timestamp used to form sequenceidentifiers uses microseconds, but there are not changes to metadataoccurring every microsecond. Third, each sequence identifier that isused applies to a single delta, and that single delta applies to aunique row of data.

In some embodiments that use egress maps similar to the one depicted inFIG. 14A, there is no overlap between distinct rows in the table. Inthese embodiments, each delta corresponds to a unique record in theegress table for each destination instance. In other embodiments,overlapping rectangles are allowed. Even when the same delta istransmitted to another instance multiple times, it will only be insertedone time, so multiple acknowledgements for the same delta do notindicate an error condition.

In some embodiments, there is a separate egress table for eachdestination instance, so the rows in each egress table do not need tospecify a destination instance. The usage of egress tables is describedin more detail below with respect to FIGS. 15A-15B.

Attention is directed back to the replication process 1100, whichcontinues in FIG. 11B. In some embodiments, the replication process 1100determines (1126) which deltas are to be sent to the third instanceusing a third egress map at the first instance, where the third egressmap specifies which combinations of row identifier and sequenceidentifier have been acknowledged as received at the third instance.This process is analogous to the process used to determine which deltasto send to the second instance.

The use of “second” in “second egress map” and “third” within “thirdegress map” are solely to identify a specific egress map, and do notimply or suggest the existence of a first egress map. This same use of“second” and “third” appears below with respect to transmission matricesas well.

The replication process 1100 builds (1128) a second transmission matrixfor the second instance that identifies deltas that have not yet beenacknowledged as received at the second instance. In some embodiments,the replication process 1100 selects a range of row identifiers, andmanages all deltas that correspond to rows with row identifiers withinthe specified range, regardless of sequence identifier. The selectionwithout regard to sequence identifier is equivalent to selecting a rangeof sequence identifiers from 0 (or the lowest value) to the highestsequence identifier currently in use. This is a two dimensionalrectangle that contains all possible deltas for the rows contained inthe rectangle. Because this large rectangle contains all possible deltasof interest, and the egress map 134 indicates which deltas have alreadybeen transmitted to the second instance and acknowledged, the difference(i.e., the set-theoretic difference) identifies the set to send to thesecond instance. This process is described in more detail with respectto FIGS. 15A-15B below.

In some embodiments, the transmission matrix is built using informationfrom the egress map about what deltas have been acknowledged as receivedby the second instance. In this case, it is possible (and sometimesdesirable) to re-send deltas that have already been transmitted to thesecond instance. In some cases resending is useful because there was afailure at some point in the previous attempt (e.g., the transmissiondid not reach the destination, the destination was down and thereforecould not receive the transmission, there was a failure at thedestination in the middle of processing the deltas, or anacknowledgement was sent back but never received at the first instance).Even if a previous transmission was fully or partially incorporated intothe destination instance, re-sending the deltas does not create aproblem because only the missing deltas will be inserted. When there-sent transmission is complete, an acknowledgement will be sent to thefirst instance for the entire batch of deltas, potentially includingsome deltas that were already incorporated into the second instance butnot yet acknowledged.

In some embodiments, the replication process builds (1130) a thirdtransmission matrix for the third instance that identifies deltas thathave not yet been acknowledged as received at the third instance. Thisprocess is analogous to building (1128) the second transmission matrixas described above.

Once transmission matrices have been created for multiple instances, thetransmission matrices and their destinations can be modified in severalways to better utilize resources. In this context, network bandwidth isone important resource that is both limited and costly. One simpleexample is illustrated in FIG. 16. In this example, suppose thetransmission matrices to the second and third instances are the same,and suppose the deltas corresponding to these transmission matrices useone unit of bandwidth. The total cost would be $5+$7=$12 if the deltaswere transmitted directly to the second and third instances usingnetwork links 104-8 and 104-7. However, if the deltas were transmittedto Instance 2 using network link 104-8, and then on to Instance 3 usingnetwork link 104-9, the total cost would be only $5+$4=$9. In general,other factors would be considered, including the availability of thenetwork bandwidth, the reliability of the network links, processingpower at each of the instances, etc.

The previous example was based on the assumption that the sametransmission matrices applied to both the second and third instances.Although this is commonly true, they may be different. However, evenwhen they are different, the difference is often small, so modifying thetransmission matrices may produce new ones that are more efficient, asexplained in more detail with respect to FIGS. 15A-15B below.

In some embodiments, the replication process 1100 modifies (1132) thetransmission matrices for the second and third instances to form one ormore revised transmission matrices. The deltas identified in eachrevised transmission matrix are transmitted (1132) to a respectivelocation to update the instance at the respective location, and deltasidentified in at least one of the revised transmission matrices aretransmitted to the second location for subsequent transmission from thesecond location to the third location. In some embodiments, themodification of the transmission matrices is based on analysis of thetotal cost for transmitting the deltas to the second and thirdgeographic locations (1134), and includes assigning (1134) a cost fortransmissions between each pair of geographic locations. In someembodiments, the modification to the transmission matrices includesdetermining (1136) bandwidth availability between the geographiclocations of the instances. In some circumstances, the transmissionmatrices for the second and third instances are the same. Sometimes whenthis occurs, there is only one revised transmission matrix, which is thesame as the transmission matrices, and deltas identified in the revisedtransmission matrix are transmitted to the second geographic locationfor subsequent transmission to the third geographic location (1138).However, having two (or more) transmission matrices that are the samedoes not necessarily lead to revising the transmission matrices, orsending the deltas to one instance for subsequent forwarding to anotherinstance. For example, if the cost of network link 104-9 in FIG. 16 were$10/Unit of Bandwidth instead of $4/Unit as depicted in the figure, thenit would be more cost effective to transmit the deltas to instance 2 andinstance 3 directly.

The replication process 1100 transmits (1140) deltas identified in thesecond transmission matrix to the second instance. If the process doesnot fail, the first instance ultimately receives (1142) acknowledgementthat transmitted deltas have been incorporated in the second instance.The replication process updates (1146) the second egress map to indicatethe acknowledged deltas. In some embodiments, the first instancereceives (1144) acknowledgement that deltas transmitted to the thirdinstance, either directly or indirectly via the second instance, havebeen incorporated into the third instance. When the first instancereceives (1144) acknowledgement regarding deltas transmitted to thethird instance, the replication process updates (1148) the third egressmap to indicate acknowledged deltas.

FIGS. 12A and 12B illustrate an exemplary compaction process 1200 thatcompacts (1202) data for rows in a distributed database with a pluralityof instances. Each instance of the database stores (1204) data on one ormore server computers, and each server computer has (1204) memory andone or more processors. Each row in the distributed database has (1206)a base value and a set of zero or more deltas as illustrated in FIG. 6.Each delta specifies (1208) a change to the base value, includes asequence identifier that specifies (1208) the order in which the deltasare to be applied to the base value, and specifies (1208) the instancewhere the delta was created. In some embodiments, each sequenceidentifier comprises (1210) a timestamp and a unique tie breaker valuethat is assigned based on hardware and/or software at each instance.

The compaction process 1200 identifies (1212) a first instance of thedistributed database. Compaction will occur at this instance. In someembodiments, the compaction process 1200 identifies (1214) a pluralityof other instances of the distributed database. In some embodiments, oneor more of the other instances are at other geographic locationsdistinct from the geographic location of the first instance. Thecompaction process 1200 selects (1216) a set of one or more rowidentifiers that identify rows of data in the distributed database. Insome embodiments, the set of rows comprises a contiguous range of TOWS.

The compaction process 1200 selects (1218) a compaction horizon for theselected set of one or more row identifiers. In some embodiments, thecompaction horizon is a sequence identifier of a delta for a rowcorresponding to a row identifier in the selected set. The compactionhorizon has the same data format as sequence identifiers so thatsequence identifiers can be compared to the compaction horizon. I.e.,each sequence identifier is either less than the compaction horizon,equal to the compaction horizon, or greater than the compaction horizon.The compaction horizon need not be equal to any of the sequenceidentifiers that are assigned to deltas.

In some embodiments, the compaction horizon must satisfy one or morecriteria. In some embodiments, deltas at the first instance withcorresponding sequence identifiers less than or equal to the compactionhorizon must have been transmitted to all other appropriate instances(1220): specifically, all deltas that (i) were created at the firstinstance, (ii) are for rows corresponding to row identifiers in theselected set of one or more row identifiers, and (iii) have sequenceidentifiers less than or equal to the compaction horizon, have beentransmitted to and acknowledged by all of the other instances thatmaintain data for the corresponding row identifiers (1220). In someembodiments, the transmission of deltas to other instances is verifiedusing one or more egress maps (which are described above with respect tothe replication process 1100). In some embodiments, the first instancemust have received all deltas from other instances that are relevant tothe selected set of rows and have sequence identifiers less than orequal to the compaction horizon (1222): specifically, all deltas that(i) were created at instances in the plurality of other instances, (ii)are for rows corresponding to row identifiers in the selected set of oneor more row identifiers, and (iii) have sequence identifiers less thanor equal to the compaction horizon, have been received at the firstinstance (1222). In some embodiments, receipt of deltas from otherinstances is verified using one or more ingress maps (which aredescribed in more detail below with respect to FIGS. 14C and 14D). Theselection of a compaction horizon is also described in more detail belowwith respect to FIG. 17.

After the compaction horizon is selected, the compaction process applies(1224), in sequence, all deltas for the selected set of one or more rowidentifiers that have sequence identifiers less than or equal to thecompaction horizon, to the base value for the corresponding rowidentifier. This is shown graphically in FIG. 8, where data item 600 hasoriginal base value 606A and set of deltas 608-1 to 608-5. In theexample of FIG. 8, the sequence identifiers for the first four deltasare less than or equal to the compaction horizon, but the fifth delta608-5 has a sequence identifier greater than the compaction horizon. Thecompaction process applies (or merges) the deltas with the original basevalue 606A to create a new base value 606B. The compaction process alsodeletes (1226) the deltas that have been applied to the base value. Inthe example in FIG. 8, the first four deltas have been deleted, leavingonly the fifth delta 608-5 (which was greater than the compactionhorizon).

FIG. 13 illustrates an exemplary process 1300 for reading (1302) a dataitem from a distributed database with a plurality of data rows. Each rowcomprises (1304) a base value and zero or more deltas that specifymodifications to the base value. This is illustrated in FIG. 6. Thereading process is performed (1306) by one or more server computers,each having memory and one or more processors.

The reading process 1300 receives (1308) a request from a client for aspecified data item 600. The request includes (1308) a row identifierthat identifies the data item 600. The process 1300 reads (1310) thebase value 606 for the specified data item from the distributeddatabase, and stores (1310) the base value in memory. The process 1300also reads (1312) the deltas 608-1 to 608-L for the specified data item,if any, from the distributed database. Each delta includes (1314) asequence identifier 704 that specifies the order in which the deltas areto be applied to the base value. Typically there are no deltas at allfor any individual data item 600, so the value for the data item is justthe base value 606.

The process 1300 applies (1316) the deltas 608 to the base value storedin memory, in sequence, resulting in a current base value stored inmemory. Unlike compaction, the reading process does not change the basedvalue 606 stored in the database. The current base value in memory isdistinct from the base value 606 in the database. When there are nodeltas for a data item, there is no work to perform in applying thedeltas. As used herein, the operation of “applying deltas to the basevalue” occurs even when there are no deltas. The process returns (1318)the current base value stored in memory to the client.

Because the read process 1300 reads and applies all of the deltas, thereading time and disk space usage for the deltas will increase overtime. Therefore, some embodiments utilize a compaction process 1200 asdescribed above, which merges deltas into the corresponding base values,which reduces both disk space usage and the time required to read dataitems.

FIGS. 14C and 14D provide exemplary data structures for ingress maps136. Ingress maps 136 identify deltas that have been received at aninstance from other instances. The ingress map shown in FIG. 14D is atypical map for use in a database. Each record in the ingress map ofFIG. 14D represents a single delta. The ingress map includes the sourceinstance 1428, which specifies the original source of the delta. Asdescribed above with respect to replication, transmissions may beforwarded from one instance to another, so a delta need not be receivedfrom the instance where the delta was created. The ingress map tracksthe original instance. Optionally, some embodiments also track theinstance that transmitted the delta to the current instance.

The ingress map also includes a row identifier 1430, which specifies therow to which the delta applies, and a sequence identifier 1432, which isglobally unique and specifies the order in which the deltas are to beapplied. In general, an instance is not aware of deltas created at otherinstances until the deltas are received, so presence of a record in theingress table indicates receipt of the delta. In alternativeembodiments, the ingress table includes a field such as “received” toindicate that the delta has been received. For large scale distributeddatabases, the ingress map of FIG. 14D is inefficient both in its use ofdisk space and in the time required to insert a very large number ofrecords. Therefore, in some embodiments, an ingress map has a datastructure similar to the one illustrated in FIG. 14C.

The ingress map in FIG. 14C specifies two dimensional rectangles ofdeltas, so each individual record identifies a very large set of deltas.In one dimension, each record in the ingress map specifies a start row1420 and an end row 1422, which specifies a contiguous range of rowidentifiers. In a second dimension, the ingress map in FIG. 14Cspecifies a start sequence 1424 and an end sequence 1426, which createsa contiguous range of sequence identifiers. In some embodiments, deltasare included in the sequence range if a delta has a sequence identifiergreater than or equal to the start sequence and less than or equal tothe end sequence. In other embodiments, there is a strict inequality onthe upper end, so that deltas are included only when the sequenceidentifier is strictly less than the end sequence. (The strictinequality could also be placed on the lower end.) In these latterembodiments, the start sequence 1424 of one record is equal to the endsequence of the previous record. In still other embodiments, records inthe ingress table do not specify a start sequence 1424, making theassumption that the starting sequence for one record is the end sequenceof the previous record. In some embodiments, the ingress table includesan identifier of the source instance. In other embodiments, there is aseparate ingress table for each other instance, so the source instanceneed not be saved in the table.

An ingress map may be used in the compaction process to identify whichdeltas have been received from other instances. In some embodiments, thesets of row identifiers used in transmissions and compaction are thesame, and are contiguous ranges that are reused. See FIGS. 15A-15B andthe associated discussion below. Because the same start row 1420 and endrow 1422 are reused, the compaction process can read the ingress recordsfor these start and end rows, and determine if there are any sequencegaps. This is illustrated in FIG. 17.

FIGS. 15A and 15B illustrate a process for developing a plan to transmitdeltas to other instances in an efficient manner according to someembodiments. In these embodiments, a range of row identifiers isselected, beginning with transmission start row 1504 and ending withtransmission end row 1506. In some embodiments, the transmission startrow 1504 and end row 1506 match the start row 1404 and end row 1406 usedin the egress maps 1516-2 and 1516-3. In addition to the selection ofrow identifiers, the process determines the highest sequence identifier1514 that has been used for any deltas at the first instance. At thispoint, all deltas within the transmission rectangle 1518 should be sentto the other instances.

Because many of the deltas have already been transmitted to otherinstances (and acknowledged as received), the actual transmissionmatrices (also known as Shapes to Send) are much smaller. The egressmaps 1516-2 and 1516-3 identify which deltas have already beentransmitted and acknowledged, so the deltas in each egress map are“subtracted” from the transmission rectangle 1518 to create thetransmission matrices 1508-2 and 1508-3 for each of the other instances.As illustrated in FIG. 15A, the egress map 1516-3 includes individualegress records 1510-1, 1510-2, 1510-3, etc., which jointly identify thedeltas already sent to instance 3 and acknowledged. The egress recordsare stored in an egress table 134 such as the one illustrated in FIG.14C. Subtracting the individual egress records 1510-1, etc. fromtransmission rectangle 1518 yields transmission matrix 1508-3.

The egress map 1516-2 to instance 2 is a little different in theillustration because there is a notch 1520 of deltas that have not beenacknowledged as received at instance 2. This may occur, for example,when the start row 1504 and end row 1506 for the transmission do notmatch the start row 1404 and end row 1406 of records in the egress map.The transmission matrix 1508-2 for instance 2 is thus not a simplerectangle. The original transmission plan 1512-1 is thus to transmitmatrix A 1508-2 to instance 2 and transmit matrix B 1508-3 to instance3. In some instances, this transmission plan will be used. However,other transmission plans are contemplated, and the costs for each of thetransmission plans are compared. In this context, “costs” come in manyforms: the actual dollar cost for use of certain bandwidth, theopportunity cost for using bandwidth that could have been used foranother process, the risk associated with network links (which couldincur other costs to retransmit or resolve), the cost in time it takesto transmit deltas to other instances, etc.

To investigate other transmission plans, several set theoreticoperations are performed on the transmission matrices A 1508-2 and B1508-3. In some embodiments, difference A−B 1508-4 and difference B−A1508-5 are computed. In the example illustrated in FIGS. 15A and 15B,A−B is a small transmission matrix C 1508-4, and B−A is the empty set1508-5. In some embodiments, the intersection A∪B 1508-6 is computed,which in this case yields a large revised transmission matrix D.Transmission matrix C 1508-4 only needs to go to instance 2, buttransmission matrix D 1508-6 needs to go to instance 2 and instance 3.If the cost of transmitting data between instance 2 and instance 3 islower than the cost of transmitting data from instance 1 to instance 3,then a good option is transmission plan 1512-2, which transmits thedeltas for matrix D 1508-6 to instance 2, which incorporates the dataand forwards the deltas for matrix D to instance 3. The deltas formatrix C 1508-4 are transmitted only to instance 2. A simple costanalysis example is illustrated in FIG. 16, described above.

Because the data in matrix D 1506 must go to instance 2 and instance 3in the illustration, an alternative transmission plan 1512-4 sends thedeltas for matrix D 1508-6 to instance 3, which incorporates the deltasand transmits them to instance 2. This alternative transmission plan maybe more cost effective if the cost of bandwidth directly from instance 1to instance 2 is more costly than bandwidth from instance 1 to instance3. In some embodiments, “over-transmission” is permitted, as illustratedin transmission plan 1512-3. In this transmission plan, transmissionmatrix A 1508-2 is sent to instance 2 (as needed), then transmitted toinstance 3, even though it contains an extra portion of deltas that arealready at the third instance. Generally, intentional over-transmissionof deltas is undesirable, but if the over-transmission is small andthere are sufficient other benefits to the transmission plan, it may bea good option.

FIG. 17 illustrates how ingress maps 1712-2, 1712-3, and 1712-4 atinstance 1 may be used in compaction operation 1222. Ingress map 1712-2identifies deltas received from instance 2, and so on. In someembodiments, the ingress maps all use the same ranges of rowidentifiers, as depicted by start row 1420 and end row 1422 in FIG. 17.In other embodiments, or under certain circumstances, different rangesmay be used. In fact, different ranges may be used even within a singleingress map 136. Each rectangle in an egress map, such as rectangles1714-1, 1714-2, and 1714-3 in ingress map 1712-4, identifies a batch ofdeltas that was received. Typically, received batches arrive in order asillustrated by ingress records 1714-1, 1714-2, and 1714-3. In someembodiments, the start sequence of one batch is the end sequence of theprevious batch. In these embodiments, deltas are included in a batch ifthere sequence identifiers are strictly greater than the start sequenceand less than or equal to the end sequence. In other embodiments, theingress map table saves only the ending sequence, and each batchincludes deltas that have sequence identifiers greater than the previousend sequence. In some rare circumstances there are gaps in the ingressmap as illustrated by gap 1704 for ingress map 1712-3 in FIG. 17. Thegap 1704 shows a range of sequence identifiers that have not yet beenreceived from instance 3.

To calculate a compaction horizon 610, the largest received sequenceidentifier for each instance is determined. For instance 2, the highestreceived sequence identifier is 1702-2, which is the end sequence of themost recent transmission from instance 2. For instance 4, the highestreceived sequence identifier is 1702-4, which is the end sequence of themost recent transmission from instance 4. For instance 3, the highestsequence identifier received is 1706 from the most recent transmission,but the gap 1704 prevents compaction beyond point 1702-3, whichrepresent the highest usable sequence identifier. The sequenceidentifiers 1702-2, 1702-3, and 1702-4 identify the highest usablesequence identifiers for each individual instance, so the compactionhorizon cannot be greater than any of these values. For example, theremay be deltas at instance 2 with sequence identifiers greater than1702-2, so the compaction horizon cannot be greater than the sequenceidentifier at 1702-2. Therefore, the compaction horizon is less than orequal to min(1702-2, 1702-3, 1702-4). In the example illustrated in FIG.17, the minimum of these is 1702-2, so the compaction horizon is at mostthe sequence identifier at 1702-2. Of course the compaction horizon isalso limited based on what deltas have been transmitted from instance 1to the other instances.

In some embodiments, a process analogous to the process just describedfor using ingress maps in the calculation of a compaction horizon alsoapplies to the use of egress maps. This is operation 1220 in FIG. 12B.For each instance other than the current instance, a maximum sequenceidentifier is determined, and the compaction horizon is limited by eachof these. This is similar to the compaction horizon being limited to thesequence identifiers 1702-2, 1702-3, and 1702-4 in the ingress maps.

In the embodiments just described, deltas with sequence identifiers lessthan or equal to the compaction horizon are merged with thecorresponding base values. In alternative embodiments, the deltas aremerged only when their sequence identifiers are strictly less than thecompaction horizon. In these embodiments, the compaction horizon isselected slightly differently. Specifically, the compaction horizon isselected to be a sequence identifier S such that, for all S′<S,

-   -   (a) Every delta for relevant entries with sequence identifier S′        has been transmitted to every other instance that potentially        has an interest in these entries (and the other instances have        acknowledged receipt of the deltas), and    -   (b) There is certainty that no delta will ever arrive in the        future for one of these relevant entries with sequence        identifier S′. In particular, (1) no delta with such a sequence        identifier will be created at the current instance, and (2) all        deltas for the relevant entries with sequence identifier S′ have        already been received locally and been acknowledged.

The manner of ensuring these conditions depends on the implementation.In some embodiments, where sequencer identifiers are assigned by ablobmaster 204, the compaction horizon S can be calculated using “firstmissing sequence identifiers” in the ingress maps 136 and egress maps134. Some embodiments define a function called ‘FirstMissingSequencer’,which returns the least sequence identifier S that is not an element ofan ingress or egress map. In this way, condition (a) is satisfied ifS<=the first missing sequence identifier for each egress map. Condition(b)(2) is satisfied if S<=the first missing sequence identifier for eachingress map. And (b)(1) follows from (a) because the sequenceidentifiers generated at an instance are monotonically increasing.Therefore, the minimum of the various first missing sequenceridentifiers provides an exemplary compaction horizon. One of ordinaryskill in the art would recognize that other embodiments could computethe compaction horizon differently.

FIGS. 18A-18E illustrate data structures that are used to store metadatain some embodiments. In some embodiments, these data structures existwithin the memory space of an executing program or process. In otherembodiments, these data structures exist in non-volatile memory, such asmagnetic or optical disk drives. In some embodiments, these datastructures form a protocol buffer, facilitating transfer of thestructured data between physical devices or processes. See, for example,the Protocol Buffer Language Guide, available athttp://code.google.com/apis/protocolbuffers/docs/proto.html.

The overall metadata structure 1802 includes three major parts: the dataabout blob generations 1804, the data about blob references 1808, andinline data 1812. In some embodiments, read tokens 1816 are also savedwith the metadata, but the read tokens are used as a means to accessdata instead of representing characteristics of the stored blobs.

The blob generations 1804 can comprise one or more “generations” of eachblob. In some embodiments, the stored blobs are immutable, and thus arenot directly editable. Instead, a “change” of a blob is implemented as adeletion of the prior version and the creation of a new version. Each ofthese blob versions 1806-1, 1806-2, etc. is a generation, and has itsown entry. In some embodiments, a fixed number of generations are storedbefore the oldest generations are physically removed from storage. Inother embodiments, the number of generations saved is set by a blobpolicy 326. (A policy can set the number of saved generations as 1,meaning that the old one is removed when a new generation is created.)In some embodiments, removal of old generations is intentionally “slow,”providing an opportunity to recover an old “deleted” generation for someperiod of time. The specific metadata associated with each generation1806 is described below with respect to FIG. 18B.

Blob references 1808 can comprises one or more individual references1810-1, 1810-2, etc. Each reference is an independent link to the sameunderlying blob content, and each reference has its own set of accessinformation. In most cases there is only one reference to a given blob.Multiple references can occur only if the user specifically requeststhem. This process is analogous to the creation of a link (a hard link)in a desktop file system. The information associated with each referenceis described below with respect to FIG. 18C.

Inline data 1812 comprises one or more inline data items 1814-1, 1814-2,etc. Inline data is not “metadata”—it is the actual content of the savedblob to which the metadata applies. For blobs that are relatively small,access to the blobs can be optimized by storing the blob contents withthe metadata. In this scenario, when a client asks to read the metadata,the blobmaster returns the actual blob contents rather than read tokens1816 and information about where to find the blob contents. Becauseblobs are stored in the metadata table only when they are small, thereis generally at most one inline data item 1814-1 for each blob. Theinformation stored for each inline data item 1814 is described below inFIG. 18D.

As illustrated in the embodiment of FIG. 18B, each generation 1806includes several pieces of information. In some embodiments, ageneration number 1822 (or generation ID) uniquely identifies thegeneration. The generation number can be used by clients to specify acertain generation to access. In some embodiments, if a client does notspecify a generation number, the blobmaster 204 will return informationabout the most current generation. In some embodiments, each generationtracks several points in time. Specifically, some embodiments track thetime the generation was created (1824). Some embodiments track the timethe blob was last accessed by a user (1826). In some embodiments, lastaccess refers to end user access, and in other embodiments, last accessincludes administrative access as well. Some embodiments track the timethe blob was last changed (1828). In some embodiments that track whenthe blob was last changed, changes apply only to metadata because theblob contents are immutable. Some embodiments provide a block flag 1830that blocks access to the generation. In these embodiments, a blobmaster204 would still allow access to certain users or clients who have theprivilege or seeing blocked blob generations. Some embodiments provide apreserve flag 1832 that will guarantee that the data in the generationis not removed. This may be used, for example, for data that is subjectto a litigation hold or other order by a court. In addition to theseindividual pieces of data about a generation, a generation has one ormore representations 1818. The individual representations 1820-1,1820-2, etc. are described below with respect to FIG. 18E.

FIG. 18C illustrates a data structure to hold an individual referenceaccording to some embodiments. Each reference 1810 includes a referenceID 1834 that uniquely identifies the reference. When a user 302 accessesa blob, the user application 308 must specify a reference ID in order toaccess the blob. In some embodiments, each reference has an owner 1836,which may be the user or process that created the reference. Eachreference has its own access control list (“ACL”), which may specify whohas access to the blob, and what those access rights are. For example, agroup that has access to read the blob may be larger than the group thatmay edit or delete the blob. In some embodiments, removal of a referenceis intentionally slow, in order to provide for recovery from mistakes.In some embodiments, this slow deletion of references is provided bytombstones. Tombstones may be implemented in several ways, including thespecification of a tombstone time 1840, at which point the referencewill be truly removed. In some embodiments, the tombstone time is 30days after the reference is marked for removal. In some embodiments,certain users or accounts with special privileges can view or modifyreferences that are already marked with a tombstone, and have the rightsto remove a tombstone (i.e., revive a blob).

In some embodiments, each reference has its own blob policy, which maybe specified by a policy ID 1842. The blob policy specifies the numberof copies of the blob, where the copies are located, what types of datastores to use for the blobs, etc. When there are multiple references,the applicable “policy” is the union of the relevant policies. Forexample, if one policy requests 2 copies, at least one of which is inEurope, and another requests 3 copies, at least one of which is in NorthAmerica, then the minimal union policy is 3 copies, with at least one inEurope and at least one in North America. In some embodiments,individual references also have a block flag 1844 and preserve flag1846, which function the same way as block and preserve flags 1830 and1832 defined for each generation. In addition, a user or owner of a blobreference may specify additional information about a blob, which mayinclude on disk information 1850 or in memory information 1848. A usermay save any information about a blob in these fields.

FIG. 18D illustrates inline data items 1814 according to someembodiments. Each inline data item 1814 is assigned to a specificgeneration, and thus includes a generation number 1822. The inline dataitem also specifies the representation type 1852, which, in combinationwith the generation number 1822, uniquely identifies a representationitem 1820. (See FIG. 18E and associated description below.) Inembodiments that allow multiple inline chunks for one blob, the inlinedata item 1814 also specifies the chunk ID 1856. In some embodiments,the inline data item 1814 specifies the chunk offset 1854, whichspecifies the offset of the current chunk from the beginning of theblob. In some embodiments, the chunk offset is specified in bytes. Insome embodiments, there is a Preload Flag 1858 that specifies whetherthe data on disk is preloaded into memory for faster access. Thecontents 1860 of the inline data item 1814 are stored with the otherdata elements.

FIG. 18E illustrates a data structure to store blob representationsaccording to some embodiments. Representations are distinct views of thesame physical data. For example, one representation of a digital imagecould be a high resolution photograph. A second representation of thesame blob of data could be a small thumbnail image corresponding to thesame photograph. Each representation data item 1820 specifies arepresentation type 1852, which would correspond to “high resolutionphoto” and “thumbnail image” in the above example. The ReplicaInformation 1862 identifies where the blob has been replicated, the listof storage references (i.e., which chunk stores have the chunks for theblob). In some embodiments, the Replica Information 1862 includes otherauxiliary data needed to track the blobs and their chunks. Eachrepresentation data item also includes a collection of blob extents1864, which specify the offset to each chunk within the blob, to allowreconstruction of the blob.

When a blob is initially created, it goes through several phases, andsome embodiments track these phases in each representation data item1820. In some embodiments, a finalization status field 1866 indicateswhen the blob is UPLOADING, when the blob is FINALIZING, and when theblob is FINALIZED. Most representation data items 1820 will have theFINALIZED status. In some embodiments, certain finalization data 1868 isstored during the finalization process.

FIG. 19 illustrates a process 1900 of utilizing tapes as a directstorage medium in a distributed storage system (1902). The method isimplemented on one or more servers, each having one or more processorsand memory (1904). Initially, a request is received (1906) to store ablob of data in a tape store. In some embodiments, these requests arelimited to background replication because reading and writing to tape isa comparatively slow process. The request includes (1908) the contentsof the blob to be stored. When the request is received, the contents ofthe blob are written (1910) to a tape store buffer. In some embodiments,a tape store buffer comprises non-volatile memory, but in otherembodiments, a tape store buffer may comprise volatile memory or acombination of volatile and non-volatile memory.

The blobs in the tape store buffer are written (1912) to tape when apredefined condition is met. In some embodiments, the predefinedcondition is that the tape store buffer fills to a first thresholdpercentage of capacity (1914). In some embodiments, the predefinedcondition is that a predefined length of time has passed since the lasttime content was written from the tape store buffer to tape (1916). Someembodiments have a predefined condition that combines both percent ofcapacity and time (e.g, when the buffer fills to a certain percent ofcapacity or a certain amount of time has elapsed).

At some point in the future, a request is received (1918) from a clientto read the blob of data from the tape store. In some embodiments, therequest must come from background replication. When the read requestsreach a certain threshold, the contents of the blob are read from tape.In some embodiments, the request threshold is based on the number ofread requests. In some embodiments, the request threshold is based onthe number of bytes in the read requests. In some embodiments, therequest threshold is based on the amount of time elapsed since the firstrequest, or the weighted average wait time for multiple requests (e.g.,weighted by the size of the blob, or a priority level). In someembodiments, the request threshold includes a combination of above (e.g,total requested bytes or maximum length of time).

The bytes that are read from tape are written to another tape storebuffer (1922). The tape store buffer for reading data from tape may bethe same buffer used for writing data to tape, or partitions of the samecomputer readable medium. In some embodiments, the two buffers aredistinct, and may comprise distinct media. For example, in someembodiments, the media used for writing is more reliable than the mediause for reading because data loss during reading could be resolved byreading the data from tape again. Once the blob has been written to thetape store buffer, a message is sent to the client indicating that theblob is available for reading.

FIG. 20 illustrates a process 2000 for storing blobs of data thatincorporates content-based de-duplication. The process is implemented(2002) on one or more servers, each having memory and one or moreprocessors. Process 2000 receives (2004) a first blob, and splits (2006)the first blob into one or more chunks. A small blob typically consistsof a single chunk, but a large blob may be split into a large number ofchunks. For example, a one gigabyte blob may be split into roughly 100individual chunks of 10 megabytes. In some embodiments, the splittinginto chunks creates chunks of a fixed size (except for the last chunk,which holds the remaining bytes of the blob). In other embodiments,chunks are selected to optimize processing or storage. For example,chunks may be selected in order to match chunks that are already stored,to take advantage of the content-based de-duplication described here.For each chunk, the process 2000 computes (2008) a content fingerprintas a function of the chunk contents. In some embodiments, the contentfingerprint is a fixed-length bit string. In some embodiments, thecontent fingerprint is a 128 bit hash value. In some embodiments, thecontent fingerprint is a 256 bit (or larger) hash value. In someembodiments, the content fingerprint is a cryptographic hash. Someembodiments use MD4, MD5, SHA-1, or SHA-2 hash functions to compute thecontent fingerprint.

The process 2000 stores (2010) the first chunks in a chunk store. Theprocess 2000 also stores (2012) the content fingerprints of the firstchunks in a store distinct from the chunk store. In some embodiments,the content fingerprints are stored with the metadata for each blob. Inother embodiments, the bitpusher 210 stores content fingerprints in anindex to facilitate lookup.

The process 2000 receives (2014) a second blob, and splits (2016) thesecond blob into one or more chunks. The process 2000 computes (2018)the content fingerprint for each of the second chunks. The process 2000compares the content fingerprints for each of the second chunks topreviously saved content fingerprints.

For each second chunk whose content fingerprint matches (2020) a contentfingerprint of a chunk that is already stored, the respective secondchunk is not stored (2024); instead, the process 2000 stores (2022) areference to the existing stored chunk with matching contentfingerprint.

For each second chunk whose content fingerprint does not match (2026) acontent fingerprint of any chunk that is already stored, the process2000 stored the respective second chunk in a chunk store.

The process of content-based de-duplication is also described below withrespect to FIG. 23. In general, chunks of one blob may match chunks froma blob saved earlier. However, it is also possible for two or morechunks within a single blob to have the same content (and thus havematching content fingerprints). In some embodiments, a chunk whosecontent fingerprint matches the content fingerprint of a chunk that hasalready been stored will not be stored, regardless of whether theearlier stored chunk is from the same blob or a different blob. Inaddition, chunks from different generations of the same blob may match.For example, a later generation of a file may be nearly identical to anearlier version, and thus there may be multiple chunks that are thesame. Although the generations are conceptually different versions ofthe blob, the storage space may overlap if some of the chunks are thesame across generations.

FIG. 21 illustrates a process 2100 of utilizing blob representations.The process 2100 is implemented (2102) on one or more servers, eachhaving memory and one or more processors. The process receives (2104) afirst representation of a blob having a specified representation type.In some embodiments, each blob has a default representation type if therepresentation type is not specified. In some embodiments, an emptystring “ ” denotes the default representation type. The process 2100stores (2106) the first representation of the blob. In addition, theprocess 2100 stores (2108) metadata for the blob, including a name ofthe blob, the representation type, and a storage location for the firstrepresentation of the blob.

The process 2100 later receives (2110) a request to create a secondrepresentation of the blob with a second representation type. In someembodiments, a client requests the second representation type using aremote procedure call. Rather than send the entire blob back (overexpensive network bandwidth) to the client to build the secondrepresentation, the second representation is created (2112) with thesecond representation type at or near the data center where the firstrepresentation of the blob is stored. The process stores (2114) thesecond representation of the blob. The second representation of the blobdoes not necessarily use the same chunk store as the originalrepresentation of the blob. For example, if the second representation isa thumbnail version of a higher resolution first representation, thethumbnail may be stored as an inline chunk, whereas the firstrepresentation may be stored in a file system store or a BigTable store.When the second representation is created, the metadata for the blob isupdated (2116) to indicate the presence of the second representation ofthe blob with the second representation type.

Subsequently, a client may request to read either representation of theblob. In particular, the process 2100 receives (2118) a request from theclients for a copy of the blob, and the request includes a specifiedrepresentation type. As noted above, some embodiments allow an emptystring to specify the default representation. In these embodiments, toidentify the non-default representation, the client must specify theappropriate representation type with a non-empty string. In response,the process 2000 retrieves (2120) either the first representation of theblob or the second representation of the blob. The retrieval corresponds(2120) to the representation type requested by the client. The process2100 then returns (2122) the retrieved representation of the blob to theclient.

The creation and retrieval of blob representation is also describedbelow with respect to FIGS. 24A-24C.

FIG. 22 illustrates an exemplary process 2200 for reading a blob ofdata. This process is also described below with respect to FIG. 25. At ahigh level, this is a two-stage process. First, find the metadata. Then,using the metadata, find the actual blob and retrieve it.

The process 2200 executes (2202) at a client on a computer with one ormore processors and memory. The process 2200 receives (2204) a requestfrom a user application 308 for a blob. The process 2200 locates (2206)an instance within the distributed storage system that is geographicallyclose to the client. At this point there is no guarantee that thelocated instance has the requested blob or even knows about the blob(i.e., has the metadata for the requested blob). The client contacts(2208) a blob access module (e.g., a blobmaster) at the located instanceto request the metadata for the requested blob. The request includes(2208) user access credentials.

The client receives (2210) from the blob access module a collection ofmetadata for the requested blob, and a set of one or more read tokens.The metadata includes information that specifies which instances havecopies of the blob. From this list of instances, the client selects(2212) an instance that has a copy of the requested blob. The clientthen contacts (2214) a data store module (e.g., a bitpusher 210) at theselected instance, and provides (2214) the data store module with theset of one or more read tokens. In some embodiments, read tokenscorrespond to the chunks that comprise the selected blob. The readtokens indicate to the data store module that the client has beenauthorized to read the specified chunks. In some embodiments, the readtokens are chunk-specific, so a client cannot acquire read tokens forone blob and use them to access chunks for a different blob.

The client receives (2216) the content of the requested blob in one ormore chunks, then assembles (2218) the one or more chunks to form therequested blob. For one-chunk blobs, assembly requires little work. Theclient then returns (2220) the blob to the user application.

Note that the process illustrated in FIG. 22 is the simple case. Thereare two common variations. First, if the blob is stored as inlinechunks, and the chunks are stored at the instance that the clientinitially contacts, the blob access module just returns the blobcontents to the client rather than returning read tokens. In someembodiments the blob access module returns the metadata as well, but inother embodiments, only the content is returned. This one-step processto retrieve inline chunks is one reason that retrieval from inlinechunks is fast.

On the other hand, the blob access module (e.g., the blobmaster) may nothave the metadata for the requested blob. In this case, the localinstance passes the request on to a global instance that has themetadata for all of the blobs. As long as the requested blob does exist,and the end user has access rights, the global instance passes themetadata back to the original local instance, and from there back to theclient. Once the client has the metadata, the process 2200 proceeds toselect (2212) an instance with a copy of the blob.

FIG. 23 illustrates graphically an exemplary process to implementcontent-based de-duplication. Blob #1 (2302-1) is received first, and issplit into three chunks 2304-1, 2304-2, and 2304-3. The process computesthe content fingerprints 2306-1, 2306-2, and 2306-3 for these threechunks. All three of the chunks have distinct content fingerprints, soall three of the chunks are stored in chunk store 2312. In addition,metadata 2310-1 is stored in the metadata store 206, and identifies thethree chunks that comprise the first blob.

Blob #2 (2302-2) is processed in the same way. Blob #2, however, issplit into four chunks 2304-4, 2304-5, 2304-6, and 2304-7. The splitinto four chunks could be based on a selected fixed size for chunks, orother chunk algorithm. For each of these four chunks, the processcomputes the associated content fingerprint 2306-4, 2306-5, 2306-6, and2306-7. Content fingerprints 2306-4, 2306-6, and 2306-7 do not match thecontent fingerprints of any chunks that are already saved, so thecorresponding chunks are saved into the chunks stores 2312. However, thecontent fingerprint 2306-5 matches (2308) content fingerprint 2306-3, sochunk 2304-5 has already been saved in the chunk stores as chunk 2304-3.Rather than save this chunk again, the metadata for blob#2 (2310-2)identifies the existing chunk (Chunk 1.3) as part of the blob contents.

This simple example illustrates some points. First, the source of thematching chunks is irrelevant. In this example, the second chunk fromone blob matches the third chunk of another blob. Second, the processcompares the content fingerprints, not the entire content of the blobs.While chunks may be relatively large (e.g., 16K bytes), some embodimentscreate content fingerprints that are small and fixed in size (e.g., 128bits). Some embodiments of the disclosed distributed storage systemutilize a hash function that virtually eliminates the risk of creatingtwo identical content fingerprints from chunks with distinct content.

FIGS. 24A-24C graphically illustrate a process of creating and usingmultiple representations of the same blob. In FIG. 24A, the client 310writes (2408) a blob 2402 to chunks stores at an instance 102.Initially, the blob has a single representation 2404. Later, a userrequests creation of a second representation 2406 of the same blob 2402.The request is transmitted (2410) to the instance 102 using a remoteprocedure call. A coprocessor module at or near instance 102 creates asecond representation 2406 with the requested second representationtype. In some embodiments, a request to create a new representation mayinclude a request to receive a copy of the new representation once it iscreated. In that case, the second representation 2406 is transmitted(2410) back to the client for presentation in the user application 308.

Once the second representation 2406 is created and saved, it can beretrieved (and replicated) like any other representation. Therefore, inthe future, a user 302 may request (2412) a copy of the secondrepresentation 2406 of the blob 2402, and the second representation 2406will be returned to the client. A more detailed description of reading ablob was presented above with respect to FIG. 22, and is described belowwith respect to FIG. 25.

FIG. 25 illustrates graphically an exemplary process flow for reading ablob. Initially, a user application 308 requests (2510) a blob from theclient library 310. In some embodiments, the client library 310 contacts(2512) a load balancer 314 to identify an appropriate instance to callfor metadata lookup. Once the load balancer 314 selects an instance102-1 to contact, the load balancer will either forward (2514) therequest to the selected instance 102-1, or return the identity of theselected instance to the client 310. In the latter case, client 310would then call the selected instance 102-1 directly. In the simple casewhere the metadata for the requested blob is at the instance 102-1, theblobmaster 204-1 retrieves the relevant metadata from the metadata store206-1 and returns (2516) the metadata to the client, along with one ormore read tokens.

The client then contacts (2518) a load balancer 314, and provides (2518)the load balancer 314 with a list of instances that have the requestedblob. Based on known loads and/or network traffic, the load balancerselects an instance 102-2 to provide the blob contents. FIG. 25illustrates a case where the instance 102-2 is not the same as theinstance 102-1 that provided the metadata. However, in many cases thesource of the metadata and the source of the blob contents will be thesame.

The load balancer 314 either forwards (2520) the blob content request tothe instance 102-2, or returns the identity of the selected instance102-2 to the client. In the latter case, the client then contacts theinstance 102-2 directly. In some embodiments, requests for blob contentsare directed to a bitpusher 210-2 at the instance 102-2. The bitpusher210-2 retrieves the chunks for the requested blob from the appropriatechunk stores 2502-2, and returns (2522) the chunks to the client 310.The client assembles (2524) the one or more chunks to reconstruct thedesired blob, then delivers (2526) the blob to the user application thatmade the original request.

In this illustrated example, the bitpusher 210-1 and chunk stores 2502-1at the initial instance 102-1 were not contacted, whereas the blobmaster204-2 and metadata store 206-2 at the second instance 102-2 were notcontacted. FIG. 25 illustrates the simple case of reading a blob, asnoted above with respect to FIG. 22. To address inline data and the casewhere the instance contacted 102-1 does not have the metadata for thedesired blob, refer to the discussion for FIG. 22.

FIG. 27 illustrates some basic blob policies that may be applied toblobs stored in embodiments of the disclosed distributed storage system200. Policy 2702 is a typical policy for storing a blob on “disk.” Inthis policy, the actual chunk store used depends on the size of theblob. Policies 2704 and 2706 represent policies that specify acombination of storage on disk and storage on tape, which are typicallyin different cities. Policies 2708 and 2710 demonstrate that policiescan specify geographic information about where blobs are stored or notstored. Policy 2712 illustrates a policy that has a time component, sothat the desired number of copies changes over time. Although notdepicted in this figure, a blob policy may specify the quality ofservice (QOS) that will be used when replicating a blob over thenetwork.

FIG. 27 also illustrates the relationship between policies and the blobsthat use those policies. Blobs 2714, 2716, 2724, and 2726 each has ablob policy that applies to it alone. Although this is allowed, policiesare rarely created for individual blobs. In general, a user applicationspecifies a small number of blob policies (e.g., 3) that apply to allblobs created or used by that user application. The policies may applyto millions of blobs. This is illustrated by policy 2706, which appliesto blobs 2718, 2720, . . . , 2722. Similarly, policy 2712 applies to allblobs between blob 2828 and blob 2730.

FIG. 28 illustrates how chunks and the associated metadata and indexesare stored according to some embodiments. Blob metadata 2802 indicatesthat the first generation of blob 1 is split into two chunks C1 and C2.Chunk C1 comprises bytes 0 to 1000 of the blob, and chunk C2 comprisesbytes 1001 to 2596 of the blob. In some embodiments, this metadata issaved in metadata store 206 and accessed by blobmaster 204. Thebitpusher 210 manages a chunk index that specifies where each of thechunks are located, and which blobs use each of the chunks. The chunkindex portion 2806 corresponding to the first chunk C1 indicates thatthe first chunk is used in both the first generation of blob 1 and thefirst generation of blob 2. The actual contents 2812 of chunk C1 are ina chunk store. In particular, there is only one physical copy of thecontents of chunk C1 even though there are two distinct blobs using thischunk.

The chunk index portion 2808 corresponding to chunk C2 indicates that itis used by the first generation of blob 1. The corresponding chunkcontents 2814 of chunk C2 (bytes 1001 to 2596) are stored in a chunkstore. The illustration in FIG. 28 also shows a second generation ofblob 1, with metadata 2804. The second generation comprises a singlechunk C3, which is different from both chunks C1 and C2. Thecorresponding chunk index portion 2810 indicates that this chunk is inuse by the second generation of blob 1, and the contents 2816 of thischunk are stored in a chunk store.

The foregoing description, for purpose of explanation, has beendescribed with reference to specific embodiments. However, theillustrative discussions above are not intended to be exhaustive or tolimit the invention to the precise forms disclosed. Many modificationsand variations are possible in view of the above teachings. Theembodiments were chosen and described in order to best explain theprinciples of the invention and its practical applications, to therebyenable others skilled in the art to best utilize the invention andvarious embodiments with various modifications as are suited to theparticular use contemplated.

1. A method of storing data for files, implemented on one or moreservers, having memory and one or more processors storing one or moreprograms for execution by the one or more processors, the methodcomprising: receiving a first blob of data; splitting the first blob ofdata into one or more first chunks of data; computing a contentfingerprint for respective first chunks of data; storing the firstchunks of data in a chunk store; storing the content fingerprints of thefirst chunks of data in a store distinct from the chunk store; receivinga second blob of data; splitting the second blob of data into one ormore second chunks of data; computing a content fingerprint forrespective second chunks of data; for a respective second chunk of datawhose content fingerprint matches a content fingerprint of a first chunkof data: storing a second reference to the corresponding first chunk ofdata that has a matching content fingerprint; and not storing the secondchunk of data; and for each second chunk of data whose contentfingerprint does not match a content fingerprint of a first chunk ofdata: storing the second chunk of data in a chunk store.
 2. A method ofstoring data for files, implemented on one or more servers, havingmemory and one or more processors storing one or more programs forexecution by the one or more processors, the method comprising:receiving a first representation of a blob of data having a specifiedfirst representation type; storing the first representation of the blobof data; storing metadata for the blob of data, including a name of theblob, the representation type, and a storage location for the firstrepresentation of the blob; receiving a request to create a secondrepresentation of the blob with a second representation type; creating asecond representation of the blob having the second representation type;storing the second representation of the blob of data; updating themetadata for the blob of data to indicate the presence of the secondrepresentation of the blob with the second representation type;receiving a request from a client for a copy of the blob, wherein therequest includes a specified representation type; retrieving either thefirst representation of the blob or the second representation of theblob, the retrieved representation of the blob corresponding to therepresentation type requested by the client; and sending the retrievedrepresentation of the blob to the client.